cjekel / piecewise_linear_fit_py

fit piecewise linear data for a specified number of line segments
MIT License
289 stars 59 forks source link

p values does not seem accurate #86

Open YungDurum opened 3 years ago

YungDurum commented 3 years ago

Dear cjekel,

In my data-analysis I tried to fit linear regressions (±160) to a dataset. I determined the breakpoints of the data with an alternative package (ruptures). Subsequently, I used the pwlf package and the method: .fit_with_break(my_breaks) to determine the slope of the lines. I now wish to determine the p-values of those slopes. For this I tried the method: .p_values(method='linear', step_size=0.0001). I deleted the first value since this corresponds to the Y1 value which I'm not interested in. I am only interested in the p-value of the slope.

The results I know get seems not accurate. Some lines that seem to have a very obvious correlation have a very high p value and some lines that seem very arbitrary have a low p-value. I added a few examples that show the p values and the contributing segment. Does anyone know what is wrong or do I maybe approach this problem in a wrong way?

In the examples the yellow hover screen show the p-value of the segment on the right and the other attributes of the line.

Schermafbeelding 2021-05-12 om 00 01 31 Schermafbeelding 2021-05-11 om 19 15 42

I figured out that when I use my_pwlf.beta I get different different values than from my.pwlf.calc_slopes()

cjekel commented 3 years ago

I'm going to work on the following code example.

import numpy as np
import matplotlib.pyplot as plt
import pwlf

# generate sin wave data
x = np.linspace(0, 10, num=100)
y = np.sin(x * np.pi / 2)
# add noise to the data
y = np.random.normal(0, 0.05, 100) + y

# initialize piecewise linear fit with your x and y data
my_pwlf = pwlf.PiecewiseLinFit(x, y)

res = my_pwlf.fitfast(10)

# predict for the determined points
xHat = np.linspace(min(x), max(x), num=10000)
yHat = my_pwlf.predict(xHat)

# plot the results
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xHat, yHat, '-')
plt.show()

How do beta parameters relate to slopes?

This is a perfect place to post questions like this. I don't mind using issues like a discussion board!

The p-values you are looking at are for the beta values and not necessarily the slopes. I fit the beta parameters and not necessarily the slopes in this formulation. Each beta parameter can be considered to have it's own p-value, but the slopes are actually the result of multiple beta parameters.

So here you can see the translation between slopes and beta parameters.

slopes = my_pwlf.calc_slopes()
for i in range(my_pwlf.n_segments):
    slope = slopes[i]
    beta_slope = np.sum(my_pwlf.beta[1:i+2])
    print(f'Segment: {i}\tSlope: {slope:.3f}\tBeta Sum: {beta_slope:.3f}')

Which will print something like the following:

Segment: 0  Slope: 1.210    Beta Sum: 1.210
Segment: 1  Slope: -0.151   Beta Sum: -0.151
Segment: 2  Slope: -1.434   Beta Sum: -1.434
Segment: 3  Slope: -0.101   Beta Sum: -0.101
Segment: 4  Slope: 1.466    Beta Sum: 1.466
Segment: 5  Slope: 0.152    Beta Sum: 0.152
Segment: 6  Slope: -1.269   Beta Sum: -1.269
Segment: 7  Slope: 1.252    Beta Sum: 1.252
Segment: 8  Slope: -0.716   Beta Sum: -0.716
Segment: 9  Slope: -1.601   Beta Sum: -1.601

To calculate the p-values of the slopes?

In order to calculate the p-values of the slopes, you are going to need to use the delta method. This is an approximate method which is needed to translate a small change in slope to changes in beta parameters. This is not currently implemented in pwlf.

The procedure will be similar to what is done in the p_values with the non-linear method, except instead of doing the perturbations on the beta parameters we will look at doing perturbations on the slopes. This will require translating small slope changes into small beta changes.

I think the following code is one way you can consider p-values.

from scipy import linalg, stats
n = my_pwlf.n_data
#nb = my_pwlf.beta.size + my_pwlf  # This is the number of model parameters that change
nb = my_pwlf.beta.size + my_pwlf.fit_breaks.size - 2
# you should probally count the number of breakpoints, and the intercept... 
k = nb - 1
step_size = 1.0e-4
A = np.zeros((n, my_pwlf.n_segments))
slopes = my_pwlf.calc_slopes()
orig_beta = my_pwlf.beta.copy()
f0 = my_pwlf.predict(my_pwlf.x_data)
for i in range(my_pwlf.n_segments):
    betas = my_pwlf.beta[1:i+2]
    n_betas = len(betas)
    # translate the step size in slope into beta change
    betas += (step_size/n_betas)
    my_pwlf.beta[1:i+2] = betas
    print(f'Slope {slopes[i]} Slope+FD {np.sum(betas)}')
    f = my_pwlf.predict(my_pwlf.x_data)
    A[:, i] = (f - f0) / step_size
    # restore orig betas
    my_pwlf.beta = orig_beta

e = f0 - my_pwlf.y_data
variance = np.dot(e, e) / (n - nb)
A2inv = np.abs(linalg.inv(np.dot(A.T, A)).diagonal())
se = np.sqrt(variance * A2inv)

# calculate my t-value
t = slopes / se
p = 2.0 * stats.t.sf(np.abs(t), df=n - k - 1)
print(p)

You will not be happy with your p-values for slopes!

So you can play around with your degrees of freedom as to what is an actual model parameter and what isn't, but I suspect that you will not be happy with your p-values. A small change to a slope will result in large errors at some point in your model due to the continuity constraint. That large error is going to tank your p-value.

I don't know what's best to do here.

We can maybe try only compute errors in one region at a time, but this is really pushing on whether your model is fundamentally continuous or discontinuous.

What you should do instead

You should really look at doing lack-of-fit tests in the breakpoints regions. This will not give you a p-value for each slope, but rather give you a p-value for each slope region by computing a F-statistic for each region. The F-statistic compares the variance in your data to your model variance.

Some resources of lack-of-fit

There is substantial lack-of-fit in the figures you posted.


I hope this helps a bit. What I would maybe first do is attempt to compute a lack-of-fit test for the entire model, then start to look at one region at a time. However, this has a very different meaning than p-values for a particular parameter.

YungDurum commented 3 years ago

Thanks for the reply. I have a view questions related to your code.

  1. Why is the variance divided by a constant for every segment (n-nb). Doesn't the variance depend on the amount of datapoints per segment. I thought it should look something like this: (sum-of-squared-error/n_datapoint-per-segment).
  2. A2inv = np.abs(linalg.inv(np.dot(A.T, A)).diagonal()), what does this line of code do?
  3. Why is the p-value multiplied by two. Is this because its a two-sided test?
  4. What is the reason for applying a small increment to the beta values. Isn't it possible to use the slope related to the segment?

Before your reply I gave it a try myself. I approached it by trying to calculate the SSE for each segment, using the slope and the breakpoints of the segments. From these values I tried to calculate the p-value.

I know it is a very poorly fit. I tried many different parameters and different settings to find the best fitting. Unfortunately due to time constraints, this is what I got and I have to make the best of it. In the end I wish to say something about the average speed (slope) of the whole dataset. Since the segments where the slope is zero is not relevant for this analysis. I wish to only use the segments where I can reject H0 (a=0.05). Hopefully in this way I can filter out the regressions where the slope is zero. So they will not influence my conclusions to much. Thanks for the advice, I will definitely have a look at it and might use it.

cjekel commented 3 years ago

I could be wrong about this, so feel free to question my logic here.

Why is the variance divided by a constant for every segment (n-nb). Doesn't the variance depend on the amount of datapoints per segment. I thought it should look something like this: (sum-of-squared-error/n_datapoint-per-segment).

I think because the model is continuous, each slope is dependent on all other slopes. Therefore, it seems appropriate to use the full dateset rather than the discrete form.

A2inv = np.abs(linalg.inv(np.dot(A.T, A)).diagonal()), what does this line of code do?

Follow the derivation in [1] https://mae.ufl.edu/nkim/Papers/paper54.pdf , specifically II. A. That line of code comes up in equation 9.

Why is the p-value multiplied by two. Is this because its a two-sided test?

Yes (at least I think so). The difference between one-sided and two-sided should just be this factor of 2.

What is the reason for applying a small increment to the beta values. Isn't it possible to use the slope related to the segment?

This is how the delta method works. You're trying to approximate the derivative of each parameter, and then you use these derivatives to compute a standard error. See [1] for the derivation.