Growing the basis using BIC

casv2 commented 12 months ago

Due to the simplicity of BIC and from experience it seems there's always a nice minimum. In terms of basis optimisation I think it may make sense to start from low polynomial degree and increase such that when the BIC increases near the minimum we terminate and use BIC optimal basis. Does this make sense?

I think it simplifies and speeds up the optimisation considerably and naturally adds complexity by growing the ACE basis as a function of data.

@bernstei Thoughts?

cortner commented 12 months ago

I am VERY interested in the basis growing thing and can help with that at the Julia end. Basically I could provide a function roughly like this:

newbasis = grow(oldbasis, ikeep, steps=2)

This would take the old basis, first reduce it to oldbasis[ikeep] then find all first and second neighbours (step = 2) in the lattice of basis functions and add them to the basis, create the newbasis and return it.

This would not be a big job for me, but I never had anybody to test it.

bernstei commented 12 months ago

I'm happy to try different optimization strategies.

wcwitt commented 12 months ago

Yuri mentioned they do something a bit like this. Underneath I think they create the big basis all at once and then iteratively unveil it

casv2 commented 11 months ago

@wcwitt "Unveiling iteratively" sounds an awful lot like forward stepwise regression? You start with the 'null' model and add features based on your favourite criterion (AIC, BIC, R²) and run until convergence. For us I think going the other way, backward stepwise regression, could be more effective. We typically have a fair bit of collinearity and recovering signal by going backwards appears to be more robust as you get to "decide" between features.

However, both methods appear to be quite dated and XGBoost seems to be the way to go nowadays. I'll experiment with it a bit more and see how it does. Using XGBoost we don't get to sample committees for free nor do we get the "zero mean" smoothness prior native to BRR/ARD. I think this means we'll have to rely more on our own smoothness priors. UQ predictions in XGBoost do seem a bit poor though, probably because it doesn't provide direct access to a mean and variance like in the Bayesian methods.

Perhaps a good middle ground is using the UQ estimates from the Bayesian solvers during training database assembly, and then use XGBoost to fit a final model.

bernstei commented 11 months ago

Is it clear the that boost ensemble doesn't provide good UQ?

cortner commented 11 months ago

For us I think going the other way, backward stepwise regression, could be more effective

I agree with that in principle. But in the past we found occasionally that "unveiling" in a physically inspired way led to smoother fits.

But I'm interested in growing rather than unveiling. This has two reasons: (1) performance especially for nonlinear models and iterative solvers and (2) because I want to learn the best way to truncate and this could mean growing far more deeply into some directions in basis space than others, which would not be covered by our sparse selection .

In practice right now I think shrinkage may well be the right thing to do.

casv2 commented 11 months ago

@bernstei No, I meant that UQ doesn't come natural to XGBoost unlike in the Bayesian methods (no analytical description of uncertainty). There are some extensions to XGBoost providing UQ estimates typically using some uncertainty calibration first. This may work very well for us, I guess we'll have to try out.

But then there is also the occasional source claiming that ARD outperforms XGBoost in terms of raw (test) accuracy. It's probably likely that you can always find counterexamples depending on the database and how long you're willing to fiddle.

@cortner I agree that growing sounds much more appealing, especially considering nonlinear fitting costs. I do wonder how deeply we get to grow the (current) basis in practice though. We're already exploring fairly high degree polynomials (14-16) using our current "backward" methods and I'm not sure there's a lot to gain by going higher. I think that 'message-passing'-like features (hopping basically) actually carry much more "relevance" in defining the PES.

cortner commented 11 months ago

We’re you kn any of the grace discussions? The festure selection is much more complex there and the same principle applies. I’m thinking beyond our simple linear models.

ACEsuit / ACEHAL

Growing the basis using BIC #23