Closed mikkelbue closed 4 months ago
Thank you so much for this - I've been interested in getting Hirsch et al's spike slab priors into pysindy for some time! And selfishly, I've wanted a reason to start interacting with jax. We implemented a bit of kernel smoothing from a paper of @jfelipeosorio in derivative, but we shied away from anything that required jax.
Giving this a quick look now (but not yet a full review), here's some more engineering, less mathematical considerations: Import
[project.optional-dependencies]
section for the dependenciesExample
python publish_notebook.py XX_Bayesian_SINDy
)Code
SBR
and SparseBayesianRegression
? I can tell the latter is used in the former, but the former being just the abbreviation of the latter doesn't give any indication of the role that these objects play. Maybe rename SBR
-> SBROptimizer
, but you're probably in the best place to suggest names.meaning_letter
. e.g. if someone's doing L-1 regression, instead of naming a variable lambda
, it would be reg_coef_lambda
. It helps people who have read the paper but not recently (me in this example) to still be able to troubleshoot errors.Ok I lied, some math
EnsembleOptimizer
, which subsamples the data and/or library, wraps any BaseOptimizer
, and takes the mean/median of resulting models? My feeling is no, which indicates that out typing system needs to expand so as to prohibit this case.Hi @Jacob-Stevens-Haas
Thanks for your comments and suggestions. I'm happy if this is useful!
I have made a few changes:
numpyro
as an optional dependency.tau_0 -> sparsity_coef_tau0
, nu -> slab_shape_nu
, s -> slab_shape_s
, lamb -> noise_hyper_lambda
.SparseBayesianRegression
into SBR
. I set it up like that because I thought it was neat to have a general-purpose sparse Bayesian regressor, but it's simpler to just make everything part of the optimizer. I'm not precious about naming, just thought SBR
would fit into the existing convention.I'll have a look at your other points when I have more time.
Also, just as a point of order:
This implementation uses the regularised horseshoe prior, not (yet) the spike and slab prior. The regularised horseshoe is computationally more convenient because it is differentiable, so you can just plug the whole shebang into a NUTS sampler.
The spike and slab prior AFAIK requires a compound step sampler or some other clever way of making categorical moves. It's something I can look into, but I am not currently sure what is the best way to achieve this in numpyro
.
Oh, sorry, yes I remember now. I meant to say horseshoe prior. I didn't mean to point you in a different direction.
@Jacob-Stevens-Haas quick update. 2 new commits:
Still looking to update the notebook with the stuff you requested.
@Jacob-Stevens-Haas Okay, I think this addresses all of your points. Please let me know if anything can be improved.
I'm not sure what to add in terms of tests. The notebook has a testing
mode now so that should cover some of it.
Thank you! I'll go ahead and review it later this week. Before then, I'll figure out how to get it so your CI runs don't need manual approval. Sorry to keep you waiting.
For tests,
test_optimizers
for tests that are parametrized to all optimizersValueErrors
(bad arguments) you want to check for.Ok I added you as a contributor with "write" permission to the repo. You may need to click an invite link in an email, but that should enable CI to run automatically on your pushes.
@Jacob-Stevens-Haas Thank you!
The tests are failing when building the docs and I can't seem to figure out why. This is the trace:
sphinx.errors.SphinxWarning: autodoc: failed to import module 'sbr' from module 'pysindy.optimizers'; the following exception was raised:
No module named 'jax'
Warning, treated as error:
autodoc: failed to import module 'sbr' from module 'pysindy.optimizers'; the following exception was raised:
No module named 'jax'
I have already included jax
as an optional dependency in the numpyro
group, so I don't understand how this is different from the way cvxpy
is set up for SINDyPI
. But I must have missed something.
The github workflow file creates a new environment and installs the current package, including all specified optional dependencies. You'd need to change this line to add your optional dependencies to the list.
Attention: 4 lines
in your changes are missing coverage. Please review.
Comparison is base (
9c73768
) 94.40% compared to head (2060875
) 94.40%. Report is 1 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
pysindy/__init__.py | 50.00% | 2 Missing :warning: |
pysindy/optimizers/__init__.py | 50.00% | 2 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Ok I added you as a contributor with "write" permission to the repo. You may need to click an invite link in an email, but that should enable CI to run automatically on your pushes.
Ah, I just found the invitation. It was on the front page of this repo :thinking: .. thank you!
Hey sorry for slow responses on my end - I'm responsible for #451 and #459, which are both pretty complex. I'll finish #451 in a few days and then I'll give this a thorough review.
Thanks for all your comments. I responded to one just now, while I had the thought. I'll address the remaining sometime next week.
Hi @Jacob-Stevens-Haas
I'm still working on this, just a little stuck on the next step. I'd like to include an example where the algorithm is used with discrete_time
, as this is the case where the Bayesian error model relates directly to the states, as discussed above.
But I am struggling to find a good example. I haven't worked much with discrete time equations before. It's easy enough to take a differential equation and put it into difference equation form. But then you are essentially just doing some Euler, and the examples I have tried so far tend to diverge very fast, unless I use very short timesteps. But that is maybe fine?
Please have a look at this example and let me know it makes sense. It's just a harmonic oscillator but rearranged as a difference equation with first order Euler integration.
Thanks!
Sorry I didn't respond - I think it's a good example. The only other common discrete system I know of is the logistic map from example 1, but either is fine. That said, I just realized that technically discrete_time=True
might have the same issue as continuous; wouldn't the integral term for $\hat x_j$ in eq 3.3 be replaced with a sum term across timepoints until $t_i$?
Again though, I still think it's valuable in both cases, it just isn't the "full-Bayesian" form in the paper. The difference between the PR and the random variable in eq 3.3 is that $X$ is now also conditioned on the calculated derivatives from the first step of SINDy. While in reality, they're those derivatives depend of $X$, we just treat them as independent measurements in the second step.
If that makes sense and checks out, I'm not sure we need an example specifically for discrete.
I'll take a look at the tests tomorrow and try to write some test for SBR/see which parametrized tests it should be added to.
Hi @Jacob-Stevens-Haas
No worries at all. I am aware that I have gone ahead and created extra work for you - without asking first :laughing:
That said, I just realized that technically discrete_time=True might have the same issue as continuous; wouldn't the integral term for in eq 3.3 be replaced with a sum term across timepoints until?
If the system is observed at all timesteps, then you shouldn't have to worry about that summation. The terms of the sum are baked into the right-hand side of the difference equation discrete_time
formulation. So you march the system forward with $x_{t+1} = f(x_t, t)$, with whichever terms you are adding together already part of $f(x_t, t)$. As far as I understand, anyway.
Apparently, I am a complete masochist, so I have gone ahead and made the Hirch formulation work with diffrax
as an integrator. This is on a separate branch if you want to have a look. It's still pretty sketchy but it does work.
Some challenges:
feature_library
classes do not handle jax.numpy.array
s, so I have implemented an unsafe version of the PolynomialFeatures
for use with the integrator.This example now uses that formulation. If I can get this new branch polished up a bit, I would consider splitting the optimizer into two: SBR
and LiteSBR
to allow using both, since they both have are pros and cons.
What are your thoughts? This would take a little more work to get over the finish line, and the solution is a little less clean than the first one, but it might be useful despite its limitations.
Quick note: I have opted to make the existing "approximate" method available through a flag exact=False
in the __init__
method of SBR
, rather than a separate LiteSBR
optimizer. All this is still on a separate branch.
I think this is a pretty robust implementation at this point. The only glaring issues are
PIDController
from diffrax
does not seem to play well with numpyro.infer.MCMC
, and a XlaRuntimeError
is raised every time I attempt to run use it. I can't figure out the reason for this issue.PolynomialLibrary
and whichever feature library we choose to "port" to the ProxyFeatureLibrary
.:sweat_smile: Ok this one's a thinker. Long story short, let's focus on this PR first. Long story still too short but a bit longer:
One of the thorniest issues I've dealt with in pysindy is the multiplicity of classes that involve repeating code to make a particular use case or method work, often signaled by if isinstance(...
calls. It makes it really hard to know how to combine methods in code or which methods are mathematically compatible, and subtle changes to the internal API cause breakages in unexpected locations. A better approach is to create the type system that works and enforces mathematical sensibility. It's better to start with discussing what about the feature library and optimizer API is missing to enable this method as an Issue (germane: #407, #351)
E.g. Are "feature library" and "optimizer" even the right abstractions to use, or is another layer implicit? Off the top of my head, this seems like it could be solved with a mixin class for all feature libraries (except maybe WeakPDELibrary, another example of a class not fitting into the existing abstractions but being forced to). But a mixin may be the wrong solution. Maybe it also requires cooperative multiple inheritance for both optimizers and feature libraries, and the SINDy
class determines which direction feature libraries to inherit (default or diffrax) at runtime?
Edit: I realize I effed up on the last equation, this has a mistake. My main confusion is why we don't still have to compose k timesteps for the kth term in the discrete case. I've gotta come back to this another day.
As for the discrete stuff, I was wrong about the sum - it's composition! It took a while to figure out, and I'm not 100% sure I have it right. But the key for me was figuring out the subtly confusing notation in eq 3.3: Way above it, Hirsh et al. introduce $\mathbf{X = [y_1, ..., yn}]^T$ as a measurement. I missed that and had stupidly assumed that a scalar element of $\mathbf{X}$ was $x{ij}$, but it's actually $y_{ij}$ :sob:
But anyways, to illustrate my point, here's Eq 3.3, the data likelihood (ignoring $j$ and $\phi$ WLOG). I need to repurpose $x$, so I'm going to use $\mathbf{Z}$ and $z_i$ to refer to measurements and $\mathbf{X}$ and $x_i$ to refer to the underlying states. $$p(z_i | \mathbf{\Xi}, x_0) = \frac{1}{...} ||z_i - \hat x_i||^2 = \frac{1}{...}||z_i - x_0 - \int_0^{t_i} \mathbf{\Xi} \Theta(x(t')) dx'||^2$$ $$\prod_i p(y_z | \mathbf{\Xi}, x_0)= p(\mathbf{Z|\Xi}, x_0)$$
For modeling a discrete system, it would be:
$$p(z_i | \mathbf{\Xi}, x_0) = \frac{1}{...}||z_i - \hat x_i||^2 = \frac{1}{...}||z_i - \mathbf{\Xi}\Theta(...\mathbf{\Xi}\Theta(x_0)...)||^2$$
with the same product expression holding. I think we've both seen that the related line in the SBR code, paraphrased here:
numpyro.sample("obs", Normal(jnp.dot(x, beta.T), sigma), obs=y)
instead represents the standard SINDy data likelihood (with beta
= $\mathbf{\Xi}$, x
= $\Theta(Z)=[..., \Theta(z_i),...]$, z
= $\dot Z =[..., \dot z_i,...]$) ,
$$p (\dot{z_i}|\mathbf{\Xi}, x_i=z_i) = \frac{1}{\dots}||\dot z_i -\mathbf{ \Xi} \Theta(z_i)||^2$$
$$\prod_i p (\dot{z_i}|\mathbf{\Xi}, x_i=z_i)= p(\mathbf{\dot Z|\Xi}, \mathbf{X=Z})$$
where we're treating the $\dot z$ calculated in the differentiation step as measurements (and $z$, smoothed in that same step, as measurements replacing the original measurements). For discrete cases, there's no smoothing/differentiation step, and IDK whether there's a probability expression we can use: $$p (zi|\mathbf{\Xi}, x{i-1} = z_{i-1}) = \frac{1}{\dots}||zi -\mathbf{ \Xi} \Theta(z{i-1})||^2$$ $$\prod_i p (zi|\mathbf{\Xi}, x{i-1} = z_{i-1})= ???$$
So... sorry if you've already addressed this in the other branch, or if I've made a mistake here. But I needed to work through it all to understand it.
So TL; DR from previous comment is that I don't think any change is needed here on the PR to address discrete vs continuous. Hirsh only deal with continuous, but I don't think the analogous data likelihood for discrete cases is what we use either.
First testing thoughts:
test_supports_multiple_targets
test_fit
test_not_fitted
test_complexity_not_fitted
test_illegal_unbias
test_normalize_columns
test_pickle
_sample_reg_horseshoe
) in a very simple case@Jacob-Stevens-Haas thanks for all you comments and thoughts!
One of the thorniest issues I've dealt with in pysindy is the multiplicity of classes that involve repeating code to make a particular use case or method work, often signaled by if isinstance(... calls.
I completely agree that this type of design pattern does not spark joy :smiley:. The branch I have linked to is an attempt to make the integral approach work without changing anything in the core library. If I would do some refactoring, I would probably equip the feature libraries with a minimal _transform
method that would implement the core operation of the transform but without any checks, so that _transform
could work for both AxesArray
and jax.numpy.array
, and then call that method directly from the diffrax
integrator as well as from the transform
method of the feature library. But that requires some careful engineering... I am happy to contribute to the discussion of how pysindy
could be refactored to better support e.g. Bayesian SINDy.
But anyways, to illustrate my point, here's Eq 3.3, ... But I needed to work through it all to understand it.
You are absolutely correct! I think I got confused because for the discrete case, since here the left-hand-side $x_i$ are members of the data space $x_i \in \mathcal D$, while that is not true for $\dot x_i$. So the inferred $\sigma$ has the correct unit for the discrete case, even when not composing multiple steps. If the steps are not composed (as in the current implementation) the underlying assumption is that the steps are independent, while when composing (or integrating), the assumption is that each step must be consistent with all the previous steps. This is a stronger constraint, which is also reflected in the more concentrated posteriors in the example that showcases both approaches.
we're treating the $\dot z$ calculated in the differentiation step as measurements (and $z_i$, smoothed in that same step, as measurements replacing the original measurements).
I wasn't aware that the differentiation method can also smooth the states! This is something to consider. I get why that is useful for "frequentist" SINDy, but it would change the character of the observation noise, which would be reflected in a different posterior distribution. It's not done for the default FiniteDifference
, is it?
I agree we should go ahead and get this merged. My concern with the non-discrete case (as mentioned before) is that $\dot x_i \notin \mathcal D$, so the inferred $\sigma$ is not clearly interpretable. But I think that is okay as long as it's clearly explained in the docstring. having this wrapped up will also make the next steps clearer. I will have a look at writing the appropriate tests some time this week.
Ah, that's a good point about $\sigma$ and $\mathcal D$. Just goes to show how deep it's possible to go in a paper.
@Jacob-Stevens-Haas
Starting to work on the tests. The GitHub workflow is giving me a strange error though:
sphinx.errors.SphinxWarning: /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pysindy/optimizers/sbr.py:docstring of pysindy.optimizers.sbr.SBR:7:Unknown target name: "mcmc".
Warning, treated as error:
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pysindy/optimizers/sbr.py:docstring of pysindy.optimizers.sbr.SBR:7:Unknown target name: "mcmc".
I think it's because it's interpreting the mcmc_
as a reference but I don't know what to do about it.
I think it's because it's interpreting the
mcmc_
as a reference but I don't know what to do about it.
Never mind, I figured it out.
Test writing progress:
Add SBR to tests of multiple optimizers
SBR-specific
test_pickle
This is now added by refactoring the existing test_pickle
using @pytest.mark.parametrize
.
Something that directly tests horsehoe prior (_sample_reg_horseshoe) in a very simple case
This is slightly problematic, since if called outside a model context, the numpyro.sample
functions would need to take an additional argument key
to produce a sample. I can adapt the function to optionally take this, but that feels a little roundabout just to get a unit test set up. What do you think?
Something that tests overall SBR in a very simple case
This is now done as test_sbr_fit
. It just asserts whether SBR
has an mcmc_
attribute after fitting and whether it contains samples of the correct variables.
I can adapt the function to optionally take this, but that feels a little roundabout just to get a unit test set up. What do you think?
No, that makes sense. I always wonder about what the right way to test things things is (and thereby, understand), but having read more about reg horseshe, I agree the roundabout work is not necessary here. I read a bit more about the reg-horseshoe, so I'll add some more info in the docstrings back in in a suggested edit.
This is now done as
test_sbr_fit
. It just asserts whether SBR has an mcmc_ attribute after fitting and whether it contains samples of the correct variables.
I meant even simpler setup, but more rigorous test, a la test_sparse_subset
. Try this one below:
def test_sbr_fit():
x = np.eye(2)
y = np.array([[1], [1e-4]])
expected = np.array([[1], [0]])
result = SBR(num_warmup=10, num_samples=10).fit(x, y).coef_
np.testing.assert_allclose(result, expected, atol=1e-1)
This is a good example of what it takes to get SBR to succeed in the simplest of cases, and I'm looking into why it's failing.
~Ok, I've got it but I have to run, will post it a bit later. x
and y
in my code need more samples - even dropping the noise prior near zero doesn't do it.~
There were two problems with my test, and I had only solved one. The first problem was recovering a reasonable answer to a simple regression problem, and that required more samples. The second problem was trying to tune the parameters to demonstrate some shrinkage. But I can't figure out a way to tune this regression to eliminate a parameter (that's why I spent a lot of time on understanding the init parameters this past week). The best I can get is fitting coefficients with no shrinkage:
test_sbr_accurate()
# It's really hard to tune SBR to get desired shrinkage
# This just tests that SBR fits "close" to unregularized regression
x = np.tile(np.eye(2), 4).reshape((-1, 2))
y = np.tile([[1], [1e-1]], 4).reshape((-1, 1))
opt = SBR(num_warmup=50,num_samples=50).fit(x, y)
result = opt.coef_
unregularized = np.array([[1, 1e-1]])
np.testing.assert_allclose(result, unregularized, atol=1e-3)
assert hasattr(opt, "mcmc_")
That's probably good enough for now. In an ideal world, we could have a test that kind of recreated canonical "horeshoe" in Figure 2 of Handling Sparsity via the Horseshoe or Figure 1 of Sparsity Information Regularization in the Horseshoe and other Shrinkage Priors.
So that's all for testing. For the notebooks,
1) Combine them (just one notebook file) as example.ipynb
2) Save as a python file
3) cd examples
, then python publish_notebook.py 19_bayesian_sindy
.
4) Run pytest --durations=10 --external_notebook 19_bayesian_sindy
to see if it tests
(or you can run pytest --durations=10 -m "notebook" -k test_notebook
and see if it gets picked up)
5) set constants that determine how long the test takes with if __name__ == "testing": <short value> else: <full value>
in order to speed up the test. num_samples
and num_warmup
are good candidates, as are anything to control the length of the simulation.
@Jacob-Stevens-Haas Thank you for keeping up with this.
The first problem was recovering a reasonable answer to a simple regression problem, and that required more samples.
This is why I avoided comparing to "theoretical" values for the test I originally wrote. MCMC will always require some samples to get a sensible answer, and I suspect that this test with 50 warmup and 50 samples is not yet a converged MCMC. But if this works, that is great.
But I can't figure out a way to tune this regression to eliminate a parameter (that's why I spent a lot of time on understanding the init parameters this past week).
Shrinkage does work as expected when running more elaborate examples, but parameters will hardly ever be exactly 0, since the ._coef
are the means of the MCMC samples. We could set a clipping threshold as in e.g. STLSQ
, if we wanted to zap the small coefficient completely.
I will have a look at converting the example to a script.
@Jacob-Stevens-Haas
So that's all for testing. For the notebooks, ...
All done. However pytest --durations=10 --external_notebook 19_bayesian_sindy
fails with pytest: error: unrecognized arguments: --external_notebook
and I can't find any documentation for that argument.
LGTM, and thank you for your dedication on this! I'll look into that pytest issue separately - pytest allows you create command line arguments in conftest.py, which is what I did for --external-notebook
. It existed before the split to example.py
/example.ipynb
, so probably not necessary anymore.
Looks like example 19 runs in 8.5 seconds under test which is great :)
Hi all!
This is a (possible) implementation of Bayesian SINDy. It uses
numpyro
under the hood to drive NUTS sampling.pysindy.optimizers
module,SBR
for Sparse Bayesian Regression.SBR
optimizer include hyperparameters to the statistical model, i.e.tau_0
,nu
ands
for the hyperprior andlamb
for the exponential prior over the measurement noise.numpyro.infer.MCMC
object.self.coef_
is set to the mean of the posterior samples.SINDy.optimizer.mcmc
object usingarviz
.This PR is a work in progress in that test and docs are missing. But I thought it would be better to put it in early in the process, in case anyone has questions, ideas or comments with regards to the implementation. Please let me know if this is a bad idea.
Thanks for considering this PR.
TODO: