Open eloubrown opened 3 years ago
The model parameters we fit are beta
parameters, in the form of Aβ=y
. For these beta
parameters, it is straight forward to calculate a standard error for each. It is not straight forward to get p-values for the slopes because they are not independent in this formulation. If you were to fit a discontinuous piecewise linear model, getting the p-values for the slopes would be easier.
Also take a look at https://github.com/cjekel/piecewise_linear_fit_py/issues/86
Hey! I too wanted to open this issue. Thanks for creating this library, it is very useful!
For these
beta
parameters, it is straight forward to calculate a standard error for each.
If it was straight forward I wouldn't have failed to find it in the docs? :)
Also take a look at #86
I read https://github.com/cjekel/piecewise_linear_fit_py/issues/86#issuecomment-841886608 and I don't understand how relevant are the p_values are to the errors of the slopes that we request...
Ideally, I'd like to get from pwlf
a correlation matrix for the fit, which I can plug into uncertainties.correlated_values()
...
Fair enough :) . I'm glad you've found this work useful!
Getting the standard errors for the model parameters is here https://jekel.me/piecewise_linear_fit_py/examples.html#non-linear-standard-errors-and-p-values
p = my_pwlf.p_values(method='non-linear', step_size=1e-4)
se = my_pwlf.se # standard errors
So se[0] is for the y intercept, se[1] is for the first slope... but this is where it gets tricky. The second slope is beta[1]+beta[2]. So if you assume standard error is following a normal distribution, then you would need to combine those two normal distributions to get a distribution for the second slope. And so forth, as slope 3 is beta[1]+beta[2]+beta[3].
The covariance matrix.
If you are doing linear regression (fits with known breakpoint locations), then I have my regression matrix as A
You can calculate your own A regression matrix in the code with:
mypwlf.assemble_regression_matrix(breaks, x)
Then your covariance matrix should be
\Sigma = \sigma^2[A^TA]^{-1}
I do this calculation here in the code:
A2inv = np.abs(linalg.pinv(np.dot(A.T, A)).diagonal())
Now if you are doing non-linear regression (unknown breakpoints), then you can use the delta method (finite differences) in a taylor series approximation to "linear-ize" the non-linear equation. This then gives you an approximate A matrix. From there you can also get an approximate covariance matrix using the similar method. I do the delta routine here https://github.com/cjekel/piecewise_linear_fit_py/blob/2e2883095fbf1783a71d52d1868f661e24f53f17/pwlf/pwlf.py#L1199
I hope that helps. There is no easy way in the library to get the standard errors of the slopes, mainly because the slopes are not independent in this model form. If the slopes were independent, then the model would not be C0 continuous.
Maybe an easy strategy would be to assume each slope is independent within a set of breakpoints. Then perturb that slope to approximate a standard error. The problem here is the overall model form is willing to sacrifice a single slope region if this results in better overall model accuracy, so you would not be getting errors that reflect the model form that you used.
Wow thanks for the detailed answer! It is super appreciated. To help myself understand the terms you use, I'm relying upon your blogpost to get the definition of beta vs the slope...
So se[0] is for the y intercept, se[1] is for the first slope... but this is where it gets tricky. The second slope is beta[1]+beta[2]. So if you assume standard error is following a normal distribution, then you would need to combine those two normal distributions to get a distribution for the second slope. And so forth, as slope 3 is beta[1]+beta[2]+beta[3].
Reading the above was confusing for me: The .se
attribute is the betas' standard error, or the slopes' errors? The acronym se
suggests it is the slopes errors, but could also be standard errors... If it is the slopes' errors, then all of my issue is indeed trivially solved, and the only issue is that .se
is simply undocumented. If .se
is the betas' errors, then uncertainties
could help us calculated the correlation matrix! Here's an example:
pwlf_ = pwlf.PiecewiseLinFit(x,y)
pwlf_.fit(N)
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
# Need to calculate p_values so that .se attribute will exist
pwlf_.p_values()
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
slopes = []
for topIdx in range(len(betas)):
slopes.append(np.sum(betas[:topIdx+1]))
# slopes with errors to each are now contained in the `slopes` list, and the
# values are correlated. The correlation matrix could be shown with:
print(np.matrix(uncertainties.correlation_matrix(slopes)))
One thing I don't understand that happens to me in my code, is that the dimension of the matrix I get is (N+1)⨯(N+1) and not N⨯N. Why is that?
I also read your explanation on the difference between linear and non-linear fitting, and I didn't try to implement it in my code yet, due to the hope that the correct correlation matrix could be obtained from the code above.
Yes, the equation of the line is y=β1+β2(x−b1)+β3(x−b2)+⋯+βnb+1(x−bnb−1)
but note the difference between the math starting at 1 and python starting at 0 for numbering.
The .se
attribute are the standard error of the model (beta) parameters, and in the non-linear case also the breakpoints. I think it only get's populated after running the p value calculation.
If .se is the betas' errors, then uncertainties could help us calculated the correlation matrix!
Yes, .se
is the standard error for the beta
values. If you do a non-linear fit, it will also include the breakpoints. I don't follow what uncertainties
is doing, but yes that looks reasonable.
You may want to use the .calc_slopes()
method after a fit, it's possible I over simplified the slopes in my previous message. It should return an array of the slopes from left to right.
One thing I don't understand that happens to me in my code, is that the dimension of the matrix I get is (N+1)⨯(N+1) and not N⨯N. Why is that?
It's because you are grabbing the y intercept of the model.
I think it only get's populated after running the p value calculation.
Yes I observed that as well, and added a comment in my code excerpt.
From some reason I also thought that pwlf_.fit(N)
is considered non linear.. Reading this example now once more makes sense. Also noting that β1
is a constant added to all pieces of the fit, I understand why the dimension of the slopes I got is N+1
. Here's a modified version of the code above:
pwlf_ = pwlf.PiecewiseLinFit(x,y)
pwlf_.fit(N)
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
# Need to calculate p_values so that .se attribute will exist
pwlf_.p_values()
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
slopes = []
# Don't iterate beta[0], as it is not a slope, but a constant added to all
# pieces of the fit
for topIdx in range(len(betas[1:])):
slopes.append(np.sum(betas[:topIdx+1]))
# slopes with errors to each are now contained in the `slopes` list, and the
# values are correlated. The correlation matrix could be shown with:
print(np.matrix(uncertainties.correlation_matrix(slopes)))
You may want to use the
.calc_slopes()
method after a fit
I don't think so, as I'd like uncertainties
to calculate the slopes, while relying on it to calculate the propagating errors by itself...
and in the non-linear case also the breakpoints.
Could you please explain how is do the array terms in .se
are divided between the breakpoints and the betas in the non linear case? In general I think .se
, and this detail should be documented. A bonus documentation to this attribute, or to the examples page, could be an uncertainties
code example like mine - which would allow you to close this issue :).
Apparently (both) of my code excerpts above were counting the indices wrongly in pwlf_.beta
... Here's a version I proved to be correct, via a numpy.allclose
assertion.
pwlf_ = pwlf.PiecewiseLinFit(x,y)
pwlf_.fit(N)
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
# Need to calculate p_values so that .se attribute will exist
pwlf_.p_values()
betas = uncertainties.unumpy.uarray(pwlf_.beta, pwlf_.se)
slopes = []
# Don't iterate beta[0], as it is not a slope, but a constant added to all
# pieces of the fit
for topIdx in range(len(betas[1:])):
slopes.append(np.sum(betas[1:topIdx+2]))
# Check that the slopes calculated above are correct
assert np.allclose(
[s.nominal_values for s in slopes]
pwlf_.calc_slopes()
)
# slopes with errors to each are now contained in the `slopes` list, and the
# values are correlated. The correlation matrix could be shown with:
print(np.matrix(uncertainties.correlation_matrix(slopes)))
Is there a way to determine the standard error in the slopes for each fitted linear equation segment? Thanks