Open utterances-bot opened 4 years ago
Great post, thanks!
👍
Small update: @raphaelsty has implemented the above code in a little library called abayes. He's also included an auto-regressive variant. The latter makes use of the fact that Bayesian linear regression can be fitted in an online manner.
Hello Max,
That was a great read! I have got a good understanding of Bayes Linear Regression. Although I am still wondering what is the point of defining self.alpha
as it turns out it is not used anywhere
Hello @kosi96, happy to help out! The alpha
parameter is used to instantiate the cov_inv
variable.
I am trying to tweak your code so I could predict n
time steps. Is that even possible?
My feature vector x does include lag variables of y target, although there are still other features that are not being predicted.
Not sure 100% what you mean @kosi96, but I would recommend looking at what @raphaelsty has done in abayes if you haven't already.
Thank you for guiding me to the right direction (abayes), although he is forecasting for n
time steps in future. Future predictions are focused on period of a single signal.
I would like to tweak it so it would forecast based on history of n
signals and no period (stock markets usually don’t have any periodicity).
That sounds very interesting and relevant. Alas I don't see any simple way to adapt the above code, apart from using a regressor chain. I'm sure there's some way to formulate a multi-output variant of Bayesian linear regression, I'm just not aware of it.
Great post, thank you very much!
Thank you for writing this, it helped me get started with online regression. The variant with a smoothing factor is bugging me a bit.
The update equation for the covariance matrix goes from:
S_{i+1} = (S_i^{-1} + \beta x_i^\intercal x_i)^{-1}
to
S_{i+1} = (\gamma S_i^{-1} + (1 - \gamma)\beta x_i^\intercal x_i)^{-1}
With the smoothing factor of 0.8, now the current covariance matrix gets a weight 4x that of the contribution from the new observation, instead of an even split earlier when not using a smoothing factor. This feels odd since the smoothing factor was introduced to counter concept drift, wouldn't it involve weighing the new observations more?
@ananis25 indeed you can just set alpha
to 0.2
to get the opposite behaviour. You can also switch the \gamma
and 1 - \gamma
if it bugs you that much. Does that make sense?
Oh, I see now how the hyperparameter alpha
can also affect this; thank you! The bit about swapping the \gamma
and 1 - \gamma
coefficients also makes sense.
I apologize if it came across as complaining, I was just curious about the intuition that lead you to this. The update equation for the cov_inv
term was a linear recurrence relation,
S_{i+1}^{-1} = S_i^{-1} + \beta x_i^\intercal x_i
to which we added coefficients parameterized by gamma
,
S_{i+1}^{-1} = \gamma S_i^{-1} + (1 - \gamma)\beta x_i^\intercal x_i
Since the terms here are regular scalars and not something like probabilities, I find it interesting no normalization factor was needed for the RHS; we are adding terms after scaling them down by a factor less than 1.
Full disclaimer: I literally pulled these formulas out from thin air because they made sense to me. There's no proof that they're correct in any statistical sense whatsoever. I validated them empirically.
Yep, empirical trial and error is often the way to go. Just wanted to ascertain I did not miss an insight. Thank you for the post again.
Bishop's book is available for free here: https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
@ananis25 The use of alpha is very reminiscent of the recursive formula for a low-pass filter. If you work it out recursively (write out the recursive form of S{i} and plug it into S{i+1} for a few recursions), you will see that the "older" data is weighted less and less compared to the newer data.
For example, take the original equation
S_{i+1}^{-1} = (\alpha)S_i^{-1} + (1-\alpha)\beta x_i^\intercal x_i
and plug in the form for the previous iteration, i.e.,
S{i}^{-1} = (\alpha)S{i-1}^{-1} + (1-\alpha)\beta x{i-1}^\intercal x{i-1}
to get
S{i+1}^{-1} = (\alpha)[(\alpha)S{i-1}^{-1} + (1-\alpha)\beta x{i-1}^\intercal x{i-1}] + (1-\alpha)\beta x_i^\intercal x_i
S{i+1}^{-1} = (\alpha)**2) S{i-1}^{-1} + \alpha(1-\alpha)\beta x{i-1}^\intercal x{i-1}] + (1-\alpha)\beta x_i^\intercal x_i
..etc.
You can see that the last term, which is the newest data point (i), has a coefficient of (1-\alpha), but the term before (i-1) is weighted with (1-\alpha) alpha. Since,
(1-\alpha)\alpha <= \alpha for 0 <= alpha <= 1
this means the newer terms are more heavily weighted. You can continue this backward in time and the pattern continues.
On another note, @MaxHalford, I tried this approach and while including an appropriate value for the \alpha terms helped the mean react to a change in the underlying data, I got nonsense for the width of the posterior. Did you look at this, or just the mean?
Hello there @scottmcdonald44. No I did not, but that's a good point. What I did is admittedly a hack, so I'm not surprised.
Do you know where I can search for estimating the beta and/or the alpha from data ? Thanks
@juanmigutierrez it's a hyperparameter so there's no free lunch! I believe there are ways to optimize them based on the data, but it's a much more complex process.
@juanmigutierrez
Beta can be learned, using a normal-gamma prior. Look here for example:
https://statproofbook.github.io/P/blr-prior.html
The update equations don't become significantly more complicated that those used in the blog. I also find these slides to be helpful:
https://www2.stat.duke.edu/~sayan/Sta613/2017/read/chapter_9.pdf
Perfect thanks @scottmcdonal44. It took me some time to understand since I don't have strong background in bayesian and could find directly the update equations. This update of alpha and beta works fine:
class RobustBayesLinReg(BayesLinReg):
def _init_(self, smoothing, n_features, alpha, beta, log):
super()._init_(n_features, alpha, beta, log)
self.smoothing = smoothing
self.n = 0
def learn(self, x, y):
x,y = self.process_input_variable(x,y)
if self.log:
x[0]=1
# Update the inverse covariance matrix, with smoothing
cov_inv = self.smoothing * self.cov_inv + (
1 - self.smoothing
) * self.beta * np.outer(x, x)
# Update the mean vector, with smoothing
cov = np.linalg.inv(cov_inv)
mean = cov @ (
self.smoothing * self.cov_inv @ self.mean
+ (1 - self.smoothing) * self.beta * y * x
)
# This Update
self.alpha = self.alpha + self.n/2
self.beta = self.beta + 1/2*((y-x @ self.mean)**2)
self.cov_inv = cov_inv
self.mean = mean
return self
Bayesian linear regression for practitioners - Max Halford
Motivation Suppose you have an infinite stream of feature vectors $x_i$ and targets $y_i$. In this case, $i$ denotes the order in which the data arrives. If you’re doing supervised learning, then your goal is to estimate $y_i$ before it is revealed to you. In order to do so, you have a model which is composed of parameters denoted $\theta_i$. For instance, $\theta_i$ represents the feature weights when using linear regression.
https://maxhalford.github.io/blog/bayesian-linear-regression/