Polynomial fit is often pretty dumb

dreeves commented 4 years ago

We get complaints about things like this occasionally, how the aura can do silly things like curve up at the end despite the data trending pretty unambiguously down:

It would be really fun to find something better!

Notes / Ideas

Tikhonov regularization
Abe Othman recommended a slightly modified ridge regression (a special case of Tikhonov regularization)
Kriging
Discussion in Slate Star Codex comments from 2015: "Any kind of smoothing is OK, though I'd stay away from a box filter (they tend to introduce artifacts). I use a 7-day variance gaussian for my regular plot (along with the individual data points) since it makes longer-term trends neatly visible."
Kernel density estimation
Twitter thread (image below; looks like R's geom_smooth uses a loess smoother by default)
StackOverflow question
Local regression
Beeminder forum thread
Kalman filter
Savitzky-Golay filter
Mathematica's smoothing functions
An implementation of smoothing splines

Verbata: math nerdery, data smoothing, polyfit, graph aesthetics, causal vs acausal filters, moving average,

pjjh commented 4 years ago

What happens if we constrain the degree of the polynomial based on the length of the x axis? i.e. how much time is in the viewport?

By default, that's the whole history, of course. A new goal could be a straight swathe (degree 1), once you've got more than N datapoints you get a quadratic, then add a degree every quarter or year?

dreeves commented 3 years ago

@pjjh, smart, I think that would improve some cases but may be fussy to get right in general. Like it might be hard to get a Pareto improvement from that trick? I'm personally more excited to get it Really Right with one of the ideas in the above list but it obviously hasn't been the highest priority for me so anyone should feel free to take a crack at it in the meantime!

Eugenio-Bruno commented 3 years ago

I don't know the right stats terminology, but what about "cross validation", so you divide the data in 4 buckets, try to fit polynomials of degree 1...N with 75% of your dataset and evaluate the error on the other 25%, then repeat this using in turn each of the buckets for validation, and you use the polynomial of degree K which has the lowest sum of validation errors (computed only on the 25% you didn't fit each time) which could be the absolute or squared difference from the points to the line?

Eugenio-Bruno commented 3 years ago

This is an example implementation of the above idea. I've run it once with absolute loss, once with squared loss, nothing's cherrypicked. Indeed, 6.png with absolute loss doesn't look great, I think that's because I don't allow polynomial degrees >10, but as you'll see when running this script in the terminal, when high-degree polynomials overfit, errors become gigantic - it's probably safe to set the limit way higher. There are four magic constants, which I'm not sure would work super great on real data, but who knows: MIN_POLY_DEGREE (which is lowered if there are less than MIN_POLY_DEGREE / 3 datapoints), MAX_POLY_DEGREE (which is capped if there are less than MAX_POLY_DEGREE / 10 datapoints) - 3 and 10 being the other two magic numbers. The rest is all as straightforward as you could implement it.

Click to expand examples and code

[examples_abs_loss.zip](https://github.com/beeminder/road/files/5709891/examples_abs_loss.zip) [examples_squared_loss.zip](https://github.com/beeminder/road/files/5709892/examples_squared_loss.zip) ```python import numpy as np from enum import Enum, auto import random import matplotlib.pyplot as plt import itertools MIN_POLY_DEGREE = 2 MAX_POLY_DEGREE = 10 for plot_idx in range(10): weight = 0 points = [weight] n_phases = random.randrange(1, 10) for phase_idx in range(n_phases): # How steeply are we losing or gaining? steepness = random.uniform(-1, 1) steepness = (steepness ** 2) * (-1 if steepness < 0 else 1) # How many points does this phase last for? n_points = random.randrange(5, 30) for point_idx in range(n_points): weight += steepness points.append(weight) points = np.array(points) plt.plot(points, ':', label = "underlying weight") # How fuzzy are the datapoints? fuzzyness = random.uniform(0.2, 2) points += np.random.normal(scale = fuzzyness, size = points.size) plt.plot(points, label = "fuzzy weighings") # This is wrong! The splits are not equal sized. # good enough for this demo, hopefully. splits = [[] for _ in range(4)] for idx, point in enumerate(points): random.choice(splits).append((idx, point)) max_poly_degree = min(MAX_POLY_DEGREE, len(points) // 3) min_poly_degree = min(MIN_POLY_DEGREE, len(points) // 10) degrees_validation_errors = [0 for _ in range(max_poly_degree - min_poly_degree)] for fit_degree in range(min_poly_degree, max_poly_degree): print(f"Fitting polynomial of degree {fit_degree}...") for i in range(4): training = list(itertools.chain([split for idx, split in enumerate(splits) if idx != i]))[0] validation = splits[i] poly = np.polyfit([point[0] for point in training], [point[1] for point in training], fit_degree) poly = np.poly1d(poly) for point in validation: degrees_validation_errors[fit_degree - min_poly_degree] += abs(poly(point[0]) - point[1]) print(f"Error: {degrees_validation_errors[fit_degree - min_poly_degree]}") print() best_degree = np.argmin(degrees_validation_errors) + min_poly_degree print() print(f"Minimum error was {degrees_validation_errors[best_degree - min_poly_degree]} for degree {best_degree}") poly = np.polyfit([idx for idx in range(len(points))], points, best_degree) poly = np.poly1d(poly) plt.plot([poly(x) for x in range(len(points))], label = f"polynomial fit of degree {best_degree}") plt.legend() plt.savefig(f"examples/{plot_idx}.png") plt.show() print() print() print() print() ```

dreeves commented 3 years ago

Thanks Eugenio! You mean doing cross-validation offline, right? To find smoothing parameters that work well for real-world Beeminder graphs? Reasonable!

As for the actual smoothing algorithm, quick recap from offline discussion:

Polynomial fit is fundamentally the wrong approach -- it's always going to introduce unjustified wiggles
Non-causal filters are 👍
Matlab's "filtfilt" could be a good approach?

Eugenio-Bruno commented 3 years ago

Thanks Eugenio! You mean doing cross-validation offline, right? To find smoothing parameters that work well for real-world Beeminder graphs? Reasonable!

I actually thought more about doing it "online", as in doing it every time a datapoint is added. It's not that computationally expensive, it's just a bunch more fits to do instead of just one.

Polynomial fit is fundamentally the wrong approach -- it's always going to introduce unjustified wiggles

Actually, I kind of see the point but I'm not sure I agree... well, I do agree, but I'm not sure the fundamentally right approach is the right approach in practice. Yes,

Non-causal filters are 👍

this might be a better option, and filtfilt is just a subset of those so one could even do fancier stuff, but the polynomial approach does have the advantage of being really, really simple. Figuring out a good cutoff for eg. the EMA filter is tricky, and there are many more choices for that value than for the degree of a polynomial, and the parameter space is not as straightforward as the degree of a polynomial either... maybe there is a simple way to do it that works as well as my above examples, but I don't know enough signal processing to know what it would be.

Eugenio-Bruno commented 3 years ago

Using cross-validation to find the best number of knots to fit for each individual data series. Spline code mostly copypasted from the internet, so definitely breaking a couple of thousands of licenses - use just to determine whether the spline method is a good fit (haha, nice pun, me!).

Click to expand the wall of non-cherrypicked examples!

![0](https://user-images.githubusercontent.com/36923671/102604335-52e06d80-4124-11eb-978b-ff3ba69615e1.png) ![1](https://user-images.githubusercontent.com/36923671/102604339-54119a80-4124-11eb-9095-5057fbae2167.png) ![2](https://user-images.githubusercontent.com/36923671/102604341-54119a80-4124-11eb-9af7-15f0d41dfafa.png) ![3](https://user-images.githubusercontent.com/36923671/102604343-54119a80-4124-11eb-8e18-475580951ecf.png) ![4](https://user-images.githubusercontent.com/36923671/102604344-54aa3100-4124-11eb-85ee-132c2e9f6791.png) ![5](https://user-images.githubusercontent.com/36923671/102604346-54aa3100-4124-11eb-9142-ca5b64f5bd0d.png) ![6](https://user-images.githubusercontent.com/36923671/102604348-5542c780-4124-11eb-85d1-6150b3c9e168.png) ![7](https://user-images.githubusercontent.com/36923671/102604349-5542c780-4124-11eb-99cc-b52c163fa15c.png) ![8](https://user-images.githubusercontent.com/36923671/102604351-5542c780-4124-11eb-8018-1f7c1e51ff2c.png) ![9](https://user-images.githubusercontent.com/36923671/102604352-55db5e00-4124-11eb-827e-77b5a3f9766b.png)

beeminder_fit.zip

Eugenio-Bruno commented 3 years ago

Also, while I don't have access to your internal anonymized data, I do have my own weight (courtesy of a beeminder goal reminding me to weigh myself!). Using the exact same spline method from above, I get this (cross validation chooses 4 knots):

saranli commented 2 years ago

There seems to be two separate features to update in beebrain, which I am not sure whether the above discussion distinguishes: 1. The moving average and 2. the aura. Since most of the discussion centered around polynomials, I was assuming people were referring to the aura with respect to the updates, but I think both need to be updated.

I think the two features differ in their goals: 1 - The moving average essentially wants to be a "descriptive" tool to show what happens with past data, once daily fluctuations are cleaned up. 2 - The aura (I think) wants to be both a predictive model, which shows how one should expect future data to look like, possibly to help plan ahead, as well as a descriptive visual that shows an "envelope" of how data changed in the past. I am not sure it can do both of these things well.

Here are some technical thoughts on both:

1 . The purple moving average is at the moment a simple exponential filter, which is why it always introduces a delay and seems to always lag behind datapoints. Some of the suggestions above, including spline interpolation could be used to replace that, but it also seems to me that a properly designed low-pass filter applied non-causally (i.e. once forward and once backwards) would address most of the issues with the purple averaging filter. It also seems like we could augment the output of this filter with a variance envelope to show how the spread of data around this average changed over time somewhat like what the aura does right now.

The predictive nature of the aura was why polynomial fits were used originally (and currently) to my knowledge. Obviously, with large goals with years of data, low-degree polynomials will fail to capture time-varying trends in the data, leading to the oddities that led to this issue being created. I am not sure whether it will be possible to provide a truly predictive model for beeminder since the model probably would have to depend on more information than is provided by the goal's data alone. Nevertheless, it should be possible to do better, either with chains of multiple polynomial fits as suggested above, or some other kind of predictive model. It seems inevitable that any model would need to be piecewise in nature since I suspect people's goals, constraints, and other factors change significantly over the course of long-running goals in particular. This could be handled by first computing a proper averaging filter as in (1) above, then analyzing trends in that average to select regions for the predictive model.

So, the following are among some of the things that one might do: a - Replace the moving average filter with a properly designed low-pass, non-causal filter that does not introduce delays b - Augment the averaging filter in (a) with a local variance values to show an "envelope" similar to the aura, except without any predictive claims. c - Replace the polynomial fit for the aura with a different model, either a chain of polynomials segmented based on monotonic regions in the average curve in (a), neural nets etc. etc. d - Get rid of the aura altogether since (b) now provides the "envelope" visualization and it is not reasonable to expect that we can predict anything about people's data behavior :) e - Limit the aura to just the past few weeks to predict the next week and just use something similar to the polynomial fit we have right now, augmented with the regularization ideas from above.

It seems to me that (a) and (b) are doable and should be done. (c) needs ideas and experimentation on actual data. (d) is probably just me being lazy. (e) probably is a good compromise?

Uluc

dreeves commented 2 years ago

Love these thoughts and great point about the distinction between the moving average line and the aura.

You mentioned https://github.com/markert/fili.js which looks smart.

I totally agree about prediction being hopeless here.

So, yeah, let's turn the moving average into a nice smooth acausal filter, kill the polyfit, and what we now call the aura can instead just be an option to turn on an envelope around that fit line.

dreeves commented 2 years ago

Some before/after shots of Uluc's new acausal moving average smoothing: