cjekel / piecewise_linear_fit_py

fit piecewise linear data for a specified number of line segments
MIT License
305 stars 62 forks source link

pwlf with unknown line segments #88

Open GvdDool opened 3 years ago

GvdDool commented 3 years ago

I am trying to run the BayesianOptimization, and am trying to understand your function def my_obj(x): -define some penalty parameter l -you'll have to arbitrarily pick this -it depends upon the noise in your data, --> how do you check this, and what are acceptable levels -and the value of your sum of the square of residuals --> how do I find/obtain this number

Could you give some ranges and explain in more detail how the penalty parameter is affecting the results?

Your assistance would be most appreciated and a great help in understanding how the function works

cjekel commented 3 years ago

Penalty parameters generally range from 1e-1 to 1e-6, and yes it's super arbitrary.

If you are looking at automatically performing these fits in a more robust manner, check out this post https://github.com/cjekel/piecewise_linear_fit_py/issues/17#issuecomment-821732674 where I look for a variance ratio. You probably need at least 20 data points for that variance ratio to work. I think this is a very novel way to automatically fit these models (and I really need to write a paper on this).

So the Bayesian optimization is trying to minimize the sum of square of residuals (mypwlf.ssr) while penalizing the model complexity (number of line segments). As the number of line segments goes to infinity, the sum of square of residuals goes to zero. Also as the number of line segments goes to infinity, the penalty on model complexity also should go to infinity. It's a dance with the devil.

I would just try lambdas = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6] and see which one gives you the best visual fit. Do this for a couple cases in your data set, and then just fix the penalty parameter to that value.

GvdDool commented 3 years ago

Thanks Charles, Knowing the range helps, not that it will help my problem. I managed to run the model with 1e-1, but I fear my set is too noisy to benefit from smaller values. I have daily nighttime light intensities for one location and am trying to fit the piecewise linear function through the data, but the variance is very high.

The function with fixed lines (see below) runs fine up to 4 lines, but introducing more lines is increasing the run time exponential, and setting the maximum elements to 20 takes 1.5hr on my laptop in a Jupyter Notebook. image

GvdDool commented 3 years ago

I used 12 line elements because this is the first point in the optimised graph, using the suggested 19 doesn't make a visual difference, and the optimising values are very similar (if not identical) image

cjekel commented 3 years ago

The variance is very high in your case, and you may benefit from trying this https://github.com/cjekel/piecewise_linear_fit_py/issues/17#issuecomment-821732674 but replace x and y with your own data. It should be biased to use very few line segments. (it should also run much faster than the Bayesian optimization routine).

GvdDool commented 3 years ago

Thanks Charles, I will check the issue, and compare the results.

One other thing I am going to try is to smooth my data with a 7-day moving average, this will remove most of the noise in the data. I tried this averaging already, to get the data stationary, and the 7 day period gives the best results (clear trend).

The reason I am trying your method is to have the piece-wise linear lines to check if there is a trend change after a known date. I can use the (known) date, but that won't prove that there is a trend change, that will (in my understanding) only show a different trend.

GvdDool commented 3 years ago

This is a view on the smoothed data: image

The event date is at the beginning of August, but what I was expecting is not the decline before the event; it should have been much more abrupt (in theory), so there is something else happening before the event (likely the COVID-19 confinements are interfering with the NightTime Light intensities in the area) Best, Gijs

GvdDool commented 3 years ago

Hi Charles, Quick update, your method #17 is giving some promising results. The method suggests 2 lines, but I think 3 segments are telling the story better. image

cjekel commented 3 years ago

What was the F ratio for both cases?

    F = sigma_hat / sigma

Maybe it's better to pick the one that is closest to 1.0, since one over and the other is under.

trueParadise commented 2 years ago

Hi Charles, Did you check my pull request? Please let me know, thanks.

Screen Shot 2022-04-08 at 2 16 26 PM