cjekel / piecewise_linear_fit_py

fit piecewise linear data for a specified number of line segments
MIT License
289 stars 59 forks source link

Error for coefficients of linear equations #87

Open eloubrown opened 3 years ago

eloubrown commented 3 years ago

Is there a way to determine the standard error in the slopes for each fitted linear equation segment? Thanks

cjekel commented 3 years ago

The model parameters we fit are beta parameters, in the form of Aβ=y. For these beta parameters, it is straight forward to calculate a standard error for each. It is not straight forward to get p-values for the slopes because they are not independent in this formulation. If you were to fit a discontinuous piecewise linear model, getting the p-values for the slopes would be easier.

Also take a look at https://github.com/cjekel/piecewise_linear_fit_py/issues/86

doronbehar commented 5 months ago

Hey! I too wanted to open this issue. Thanks for creating this library, it is very useful!

For these beta parameters, it is straight forward to calculate a standard error for each.

If it was straight forward I wouldn't have failed to find it in the docs? :)

Also take a look at #86

I read https://github.com/cjekel/piecewise_linear_fit_py/issues/86#issuecomment-841886608 and I don't understand how relevant are the p_values are to the errors of the slopes that we request...

Ideally, I'd like to get from pwlf a correlation matrix for the fit, which I can plug into uncertainties.correlated_values()...

cjekel commented 5 months ago

Fair enough :) . I'm glad you've found this work useful!

Getting the standard errors for the model parameters is here https://jekel.me/piecewise_linear_fit_py/examples.html#non-linear-standard-errors-and-p-values

p = my_pwlf.p_values(method='non-linear', step_size=1e-4)
se = my_pwlf.se  # standard errors

So se[0] is for the y intercept, se[1] is for the first slope... but this is where it gets tricky. The second slope is beta[1]+beta[2]. So if you assume standard error is following a normal distribution, then you would need to combine those two normal distributions to get a distribution for the second slope. And so forth, as slope 3 is beta[1]+beta[2]+beta[3].


The covariance matrix.

If you are doing linear regression (fits with known breakpoint locations), then I have my regression matrix as A image

You can calculate your own A regression matrix in the code with:

mypwlf.assemble_regression_matrix(breaks, x)

Then your covariance matrix should be \Sigma = \sigma^2[A^TA]^{-1} image

I do this calculation here in the code:

                A2inv = np.abs(linalg.pinv(np.dot(A.T, A)).diagonal())

https://github.com/cjekel/piecewise_linear_fit_py/blob/2e2883095fbf1783a71d52d1868f661e24f53f17/pwlf/pwlf.py#L1237C1-L1237C18

Now if you are doing non-linear regression (unknown breakpoints), then you can use the delta method (finite differences) in a taylor series approximation to "linear-ize" the non-linear equation. This then gives you an approximate A matrix. From there you can also get an approximate covariance matrix using the similar method. I do the delta routine here https://github.com/cjekel/piecewise_linear_fit_py/blob/2e2883095fbf1783a71d52d1868f661e24f53f17/pwlf/pwlf.py#L1199


I hope that helps. There is no easy way in the library to get the standard errors of the slopes, mainly because the slopes are not independent in this model form. If the slopes were independent, then the model would not be C0 continuous.

Maybe an easy strategy would be to assume each slope is independent within a set of breakpoints. Then perturb that slope to approximate a standard error. The problem here is the overall model form is willing to sacrifice a single slope region if this results in better overall model accuracy, so you would not be getting errors that reflect the model form that you used.

doronbehar commented 5 months ago

Wow thanks for the detailed answer! It is super appreciated. To help myself understand the terms you use, I'm relying upon your blogpost to get the definition of beta vs the slope...

So se[0] is for the y intercept, se[1] is for the first slope... but this is where it gets tricky. The second slope is beta[1]+beta[2]. So if you assume standard error is following a normal distribution, then you would need to combine those two normal distributions to get a distribution for the second slope. And so forth, as slope 3 is beta[1]+beta[2]+beta[3].

Reading the above was confusing for me: The .se attribute is the betas' standard error, or the slopes' errors? The acronym se suggests it is the slopes errors, but could also be standard errors... If it is the slopes' errors, then all of my issue is indeed trivially solved, and the only issue is that .se is simply undocumented. If .se is the betas' errors, then uncertainties could help us calculated the correlation matrix! Here's an example:

pwlf_ = pwlf.PiecewiseLinFit(x,y)
pwlf_.fit(N)
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
# Need to calculate p_values so that .se attribute will exist
pwlf_.p_values()
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
slopes = []
for topIdx in range(len(betas)):
    slopes.append(np.sum(betas[:topIdx+1]))
# slopes with errors to each are now contained in the `slopes` list, and the
# values are correlated. The correlation matrix could be shown with:
print(np.matrix(uncertainties.correlation_matrix(slopes)))

One thing I don't understand that happens to me in my code, is that the dimension of the matrix I get is (N+1)⨯(N+1) and not N⨯N. Why is that?

I also read your explanation on the difference between linear and non-linear fitting, and I didn't try to implement it in my code yet, due to the hope that the correct correlation matrix could be obtained from the code above.

cjekel commented 5 months ago

Yes, the equation of the line is y=β1+β2(x−b1)+β3(x−b2)+⋯+βnb+1(x−bnb−1)

but note the difference between the math starting at 1 and python starting at 0 for numbering.

The .se attribute are the standard error of the model (beta) parameters, and in the non-linear case also the breakpoints. I think it only get's populated after running the p value calculation.

If .se is the betas' errors, then uncertainties could help us calculated the correlation matrix!

Yes, .se is the standard error for the beta values. If you do a non-linear fit, it will also include the breakpoints. I don't follow what uncertainties is doing, but yes that looks reasonable.

You may want to use the .calc_slopes() method after a fit, it's possible I over simplified the slopes in my previous message. It should return an array of the slopes from left to right.

One thing I don't understand that happens to me in my code, is that the dimension of the matrix I get is (N+1)⨯(N+1) and not N⨯N. Why is that?

It's because you are grabbing the y intercept of the model.

doronbehar commented 5 months ago

I think it only get's populated after running the p value calculation.

Yes I observed that as well, and added a comment in my code excerpt.

From some reason I also thought that pwlf_.fit(N) is considered non linear.. Reading this example now once more makes sense. Also noting that β1 is a constant added to all pieces of the fit, I understand why the dimension of the slopes I got is N+1. Here's a modified version of the code above:

pwlf_ = pwlf.PiecewiseLinFit(x,y)
pwlf_.fit(N)
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
# Need to calculate p_values so that .se attribute will exist
pwlf_.p_values()
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
slopes = []
# Don't iterate beta[0], as it is not a slope, but a constant added to all
# pieces of the fit
for topIdx in range(len(betas[1:])):
    slopes.append(np.sum(betas[:topIdx+1]))
# slopes with errors to each are now contained in the `slopes` list, and the
# values are correlated. The correlation matrix could be shown with:
print(np.matrix(uncertainties.correlation_matrix(slopes)))

You may want to use the .calc_slopes() method after a fit

I don't think so, as I'd like uncertainties to calculate the slopes, while relying on it to calculate the propagating errors by itself...

and in the non-linear case also the breakpoints.

Could you please explain how is do the array terms in .se are divided between the breakpoints and the betas in the non linear case? In general I think .se, and this detail should be documented. A bonus documentation to this attribute, or to the examples page, could be an uncertainties code example like mine - which would allow you to close this issue :).

doronbehar commented 5 months ago

Apparently (both) of my code excerpts above were counting the indices wrongly in pwlf_.beta... Here's a version I proved to be correct, via a numpy.allclose assertion.

pwlf_ = pwlf.PiecewiseLinFit(x,y)
pwlf_.fit(N)
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
# Need to calculate p_values so that .se attribute will exist
pwlf_.p_values()
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
slopes = []
# Don't iterate beta[0], as it is not a slope, but a constant added to all
# pieces of the fit
for topIdx in range(len(betas[1:])):
    slopes.append(np.sum(betas[1:topIdx+2]))
# Check that the slopes calculated above are correct
assert np.allclose(
    [s.nominal_values for s in slopes]
    pwlf_.calc_slopes()
)
# slopes with errors to each are now contained in the `slopes` list, and the
# values are correlated. The correlation matrix could be shown with:
print(np.matrix(uncertainties.correlation_matrix(slopes)))