Suggestion: batch data processing

cjekel / piecewise_linear_fit_py

fit piecewise linear data for a specified number of line segments

MIT License

289 stars 59 forks source link

Suggestion: batch data processing #61

Open vkhodygo opened 4 years ago

vkhodygo commented 4 years ago

I have a particular problem that requires a piece-wise approximation of similar datasets. My current solution is to process the data one by one, next I extract the slopes, find the average and the standard deviation. However, I found someone having a similar issue and the offered solution is to "fit a multilevel model". Is it possible to implement this feature somehow?

cjekel commented 4 years ago

What do you think about this approach in your case?

First perform an initial fit, to all of your datasets (combined as one single large dataset) to find the break points. This part would be expensive.

Once you know the break points, then you can fit each individual dataset with fit_with_breaks. This part would be cheap.

I've been thinking about create a tools.py included in the library for things like this, and this might be a good candidate for this.

vkhodygo commented 4 years ago

@cjekel

First perform an initial fit, to all of your datasets (combined as one single large dataset) to find the break points. This part would be expensive.

Indeed, that is a convenient way, however, I have measurements at the same points, I'm not sure that in the case of f(x_1) = y_1, f(x_1) = y_2, etc. you get a defined behaviour, but I might be wrong.

cjekel commented 4 years ago

Do you know where the breakpoints should be ahead of time, or do you need to find them first?

I'll code an example shortly of what I'm thinking.

cjekel commented 4 years ago

This jupyter notebook describes what I was thinking

https://github.com/cjekel/piecewise_linear_fit_py/blob/master/examples/experiment_with_batch_process.py.ipynb

vkhodygo commented 4 years ago

Hey, @cjekel, sorry, I'm knee-deep in my thesis and I systematically swing from writing to actual calculations. a) I don't know the positions of breakpoints but I can use one of the sets to get them first. However, since these positions are not completely fixed from dataset to dataset, the result may vary. b) I've checked the notebook, the question stays the same: is it in general correct to pass a multivalued function instead of a bunch of regular ones?

cjekel commented 4 years ago

@vkhodygo Sorry for the delay.

Since the breakpoints may vary from set to set (non linear regression), it's not easy to apply that multi level model (linear regression).

What you could do though to speed things up is use the breakpoints results from the j set as a fit_with_guess() start for the j+1 set. This optimization should be much faster than running fit(). Then at the end, you'll have to take standard deviations of both your slopes and of all the different breakpoints.

I've been thinking about making a tools sections in pwlf with scripts that could be useful to call pwlf. This type of process could go there.