[Feature] robust regression and error function

mrgreywater commented 6 years ago

With my dataset, I had some trouble where no matter what is entered as min_stop_frac, it would always return the same amount of segments. It would merge large segments that had almost no deviations before (creating a visible error), instead of merging small segments with bigger deviations. Taking the square root of the squared error that is returned by the OLS linear regression function fixed it, reduced the error and improved the result immensely:

I replaced : https://github.com/DataDog/piecewise/blob/3a15a1c3113cbbecf979bb318f19f2c7fbdc9408/piecewise/regressor.py#L301 with

return tuple(coeffs), 0.0 if len(error) == 0 else float(math.sqrt(error))

I'm not sure if this should be pushed in a PR, as it would have to be checked against more data, but maybe it would make sense from an api standpoint to let the user supply his own linear regression/cost function, so one can use a more robust regression like Theil-San estimator or RANSAC incase the dataset has outliers.

Thank you lots for this library, it helped me quite a bit.

StephenKappel commented 6 years ago

I definitely like this idea of making the cost function pluggable. If you're interested in pushing a PR, that'd be great. Otherwise, I can probably get around to adding this functionality eventually, although I'm not sure how soon.

Andrey-Pavlov commented 6 years ago

@mrgreywater Thank you! @StephenKappel But how to increase number of segments? I want to see red lines on the chart. min_stop_frac doesn't help https://puu.sh/AP63P/967d3d3243.png

jrtechs commented 5 years ago

I had similar issues. I initially only got 2 regression partitions, after applying @mrgreywater's changes I got 3 partitions. However, no matter the value of min_stop_frac I only get 3 partitions when there are 4 very distinct partitions.

figure_1

@StephenKappel any suggestions? I would love to play around with this and make a PR.

StephenKappel commented 5 years ago

Hey @jrtechs -- I'd be happy to have a PR to make this better! Here's some ideas...

Currently, the algorithm only remembers the segments from what it thinks is the best state so far. It does this based on the increase in total error (i.e., the cost of the merge). It does not consider how many merges remain before it ends up with a single segment, nor does it consider granular information about the cost of merges in the past.

The algorithm is trying to catch the big jump in total error that normally accompanies the one-merge-too-far. Something like shown here...

I think the problem comes when that sudden increase in error starts with an error increase that isn't the single largest error increase.

For example, the algorithm will do the right thing with this series of merge costs:

1, 1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 100, 125, 100, 105

But it will go one merge too far for this series of merge costs:

1, 1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 50, 105, 100, 125

We could try:

Tracking the cost of all merges to date, so we could use some more interesting statistics of the merge costs rather than just the max and cumulative sum of the past.
Remembering more than one state so that we don't need to make the decision of whether or not to discard a previous potentially-interesting state until we see all the merge costs.

The hope would be that, in retrospect, we'd be able to make smarter choices than we can on the fly.

A simpler solution for the specific example you provide could be providing a parameter to require more significant increases in cost for a merge to be considered the tipping point. On this line, cost_increase == biggest_cost_increase could become something like cost_increase >= 2.0*biggest_cost_increase, where 2.0 would be parameterized.

jrtechs commented 5 years ago

@StephenKappel I will definitely look into that. Just for clarification, what would be the difference between the min_stop_frac and this new error_increase_tolerance? Does the min_stop_frac define the increase in error allowed before we stop merging where the error_increase_tolerance would help us push more partitions/buckets to be merged?

StephenKappel commented 5 years ago

The main motivation for min_stop_frac is to prevent the algorithm from giving a suboptimal solution when the "best" solution is a single line segment. It prevents the algorithm from stopping merging too soon if no single merge has led to a large fraction of the total error.

A new parameter would help the algorithm stop sooner, but this wouldn't contradict min_stop_frac. That is, in order to stop merging, the min_stop_frac-based threshold would still have to be met. However, after that's met (it will always be met for every future iteration after it's first true), we need a new parameter to help prevent too much merging.

DataDog / piecewise

[Feature] robust regression and error function #4