utterances-bot commented 4 years ago

Bayesian linear regression for practitioners - Max Halford

Motivation Suppose you have an infinite stream of feature vectors $x_i$ and targets $y_i$. In this case, $i$ denotes the order in which the data arrives. If you’re doing supervised learning, then your goal is to estimate $y_i$ before it is revealed to you. In order to do so, you have a model which is composed of parameters denoted $\theta_i$. For instance, $\theta_i$ represents the feature weights when using linear regression.

https://maxhalford.github.io/blog/bayesian-linear-regression/

david26694 commented 4 years ago

Great post, thanks!

bamasa commented 4 years ago

👍

MaxHalford commented 4 years ago

Small update: @raphaelsty has implemented the above code in a little library called abayes. He's also included an auto-regressive variant. The latter makes use of the fact that Bayesian linear regression can be fitted in an online manner.

kosi96 commented 3 years ago

Hello Max,

That was a great read! I have got a good understanding of Bayes Linear Regression. Although I am still wondering what is the point of defining self.alpha as it turns out it is not used anywhere

MaxHalford commented 3 years ago

Hello @kosi96, happy to help out! The alpha parameter is used to instantiate the cov_inv variable.

kosi96 commented 3 years ago

I am trying to tweak your code so I could predict n time steps. Is that even possible?

My feature vector x does include lag variables of y target, although there are still other features that are not being predicted.

MaxHalford commented 3 years ago

Not sure 100% what you mean @kosi96, but I would recommend looking at what @raphaelsty has done in abayes if you haven't already.

kosi96 commented 3 years ago

Thank you for guiding me to the right direction (abayes), although he is forecasting for n time steps in future. Future predictions are focused on period of a single signal.

I would like to tweak it so it would forecast based on history of n signals and no period (stock markets usually don’t have any periodicity).

MaxHalford commented 3 years ago

That sounds very interesting and relevant. Alas I don't see any simple way to adapt the above code, apart from using a regressor chain. I'm sure there's some way to formulate a multi-output variant of Bayesian linear regression, I'm just not aware of it.

kobabulbul commented 3 years ago

Great post, thank you very much!

ananis25 commented 3 years ago

Thank you for writing this, it helped me get started with online regression. The variant with a smoothing factor is bugging me a bit.

The update equation for the covariance matrix goes from:

S_{i+1} = (S_i^{-1} + \beta x_i^\intercal x_i)^{-1}

to

S_{i+1} = (\gamma S_i^{-1} + (1 - \gamma)\beta x_i^\intercal x_i)^{-1}

With the smoothing factor of 0.8, now the current covariance matrix gets a weight 4x that of the contribution from the new observation, instead of an even split earlier when not using a smoothing factor. This feels odd since the smoothing factor was introduced to counter concept drift, wouldn't it involve weighing the new observations more?

MaxHalford commented 3 years ago

@ananis25 indeed you can just set alpha to 0.2 to get the opposite behaviour. You can also switch the \gamma and 1 - \gamma if it bugs you that much. Does that make sense?

ananis25 commented 3 years ago

Oh, I see now how the hyperparameter alpha can also affect this; thank you! The bit about swapping the \gamma and 1 - \gamma coefficients also makes sense.

I apologize if it came across as complaining, I was just curious about the intuition that lead you to this. The update equation for the cov_inv term was a linear recurrence relation,

S_{i+1}^{-1} = S_i^{-1} + \beta x_i^\intercal x_i

to which we added coefficients parameterized by gamma,

S_{i+1}^{-1} = \gamma S_i^{-1} + (1 - \gamma)\beta x_i^\intercal x_i

Since the terms here are regular scalars and not something like probabilities, I find it interesting no normalization factor was needed for the RHS; we are adding terms after scaling them down by a factor less than 1.

MaxHalford commented 3 years ago

Full disclaimer: I literally pulled these formulas out from thin air because they made sense to me. There's no proof that they're correct in any statistical sense whatsoever. I validated them empirically.

ananis25 commented 3 years ago

Yep, empirical trial and error is often the way to go. Just wanted to ascertain I did not miss an insight. Thank you for the post again.

alessandro-gentilini commented 2 years ago

Bishop's book is available for free here: https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf

scottmcdonald44 commented 10 months ago

@ananis25 The use of alpha is very reminiscent of the recursive formula for a low-pass filter. If you work it out recursively (write out the recursive form of S{i} and plug it into S{i+1} for a few recursions), you will see that the "older" data is weighted less and less compared to the newer data.

For example, take the original equation

S_{i+1}^{-1} = (\alpha)S_i^{-1} + (1-\alpha)\beta x_i^\intercal x_i

and plug in the form for the previous iteration, i.e.,

S{i}^{-1} = (\alpha)S{i-1}^{-1} + (1-\alpha)\beta x{i-1}^\intercal x{i-1}

to get

S{i+1}^{-1} = (\alpha)[(\alpha)S{i-1}^{-1} + (1-\alpha)\beta x{i-1}^\intercal x{i-1}] + (1-\alpha)\beta x_i^\intercal x_i

S{i+1}^{-1} = (\alpha)**2) S{i-1}^{-1} + \alpha(1-\alpha)\beta x{i-1}^\intercal x{i-1}] + (1-\alpha)\beta x_i^\intercal x_i

..etc.

You can see that the last term, which is the newest data point (i), has a coefficient of (1-\alpha), but the term before (i-1) is weighted with (1-\alpha) alpha. Since,

(1-\alpha)\alpha <= \alpha for 0 <= alpha <= 1

this means the newer terms are more heavily weighted. You can continue this backward in time and the pattern continues.

On another note, @MaxHalford, I tried this approach and while including an appropriate value for the \alpha terms helped the mean react to a change in the underlying data, I got nonsense for the width of the posterior. Did you look at this, or just the mean?

MaxHalford commented 10 months ago

Hello there @scottmcdonald44. No I did not, but that's a good point. What I did is admittedly a hack, so I'm not surprised.

juanmigutierrez commented 8 months ago

Do you know where I can search for estimating the beta and/or the alpha from data ? Thanks

MaxHalford commented 8 months ago

@juanmigutierrez it's a hyperparameter so there's no free lunch! I believe there are ways to optimize them based on the data, but it's a much more complex process.

scottmcdonald44 commented 7 months ago

@juanmigutierrez

Beta can be learned, using a normal-gamma prior. Look here for example:

https://statproofbook.github.io/P/blr-prior.html

The update equations don't become significantly more complicated that those used in the blog. I also find these slides to be helpful:

https://www2.stat.duke.edu/~sayan/Sta613/2017/read/chapter_9.pdf

juanmigutierrez commented 7 months ago

Perfect thanks @scottmcdonal44. It took me some time to understand since I don't have strong background in bayesian and could find directly the update equations. This update of alpha and beta works fine:

class RobustBayesLinReg(BayesLinReg):

    def _init_(self, smoothing, n_features, alpha, beta, log):
        super()._init_(n_features, alpha, beta, log)
        self.smoothing = smoothing
        self.n = 0

    def learn(self, x, y):
        x,y = self.process_input_variable(x,y)
        if self.log:
            x[0]=1

        # Update the inverse covariance matrix, with smoothing
        cov_inv = self.smoothing * self.cov_inv + (
            1 - self.smoothing
        ) * self.beta * np.outer(x, x)

        # Update the mean vector, with smoothing
        cov = np.linalg.inv(cov_inv)
        mean = cov @ (
            self.smoothing * self.cov_inv @ self.mean
            + (1 - self.smoothing) * self.beta * y * x
        )

        # This Update
        self.alpha = self.alpha + self.n/2
        self.beta = self.beta + 1/2*((y-x @ self.mean)**2)

        self.cov_inv = cov_inv
        self.mean = mean

        return self

MaxHalford / maxhalford.github.io

blog/bayesian-linear-regression/ #2

Bayesian linear regression for practitioners - Max Halford