which method is more reliable?

franpoz commented 3 years ago

Hi Dan, amazing work done here. Congrats.
Doing some tests with the last version of SPOCK, which incorporates the FeatureClassifier , DeepRegressor and NbodyRegressor I found some results which I'm not sure to interpret correctly.

For example, for a 4-planet system I got:

feature_model.predict_stable(sim)
>>0.12605628

deep_model.predict_stable(sim)
>>0.67

nbody.predict_stable(sim,tmax=1e6*sim.particles[1].P)
>>1

From the FeatureClassifier, I would say that the system is likely non-stable, but from DeepRegressor and NbodyRegressor it seems that yes. Any suggestion?

thanks!

dtamayo commented 3 years ago

Hi Fran,

Thanks for posting this question--I bet many others will have the same one! One great thing about the DeepRegressor is that it also gives you estimates of the model uncertainty. The way to look at this is to get the actual samples:

prob_stability, samples = deep_model.predict_stable(sim, return_samples=True)

More confidently predicted systems will have narrower estimates, ones it's unsure about will be wide. A typical uncertainty is about +/- 1 dex. That might seem high, but note that an instability time you measure through a direct N-body integration also has an uncertainty due to chaos of about +/- 0.5 dex.

Of course, our models are also not perfect! You can definitely find cases like this one where they disagree. I think it's important to think about what you care about in your application. If you want to err on the side of making sure that the systems you accept are stable (at the expense of throwing out some extra ones that were stable), you could be conservative and require that both classifiers agree. If on the other end of the spectrum you want to avoid throwing out stable systems, you could require that either of the classifiers say it's stable. People also do this combining different imperfect tests for diseases.

franpoz commented 3 years ago

Ok, great! Thanks Daniel!

One last question, has been SPOCK tested for non-compact systems? might be used for these scenarios?

dtamayo commented 3 years ago

We haven't pushed too far, but I think it's an interesting question how far the model can be pushed in different directions.

Big picture, what helped trained the model and the reason for specializing to compact systems (apart from that applying to most low-mass multis) is that it narrows the path to instability. In particular, most of these short-timescale instabilities in compact systems are driven by the overlap of mean motion resonances. That's why the hand-engineered features in the FeatureClassifier use analytic resonance models to help the classifier. As you start spacing systems further out (beyond period ratios of 2), you run out of strong mean motion resonances, and secular effects become important. Because secular resonances have longer timescales, instabilities through secular resonance overlap also take longer to develop (e.g., > 10^10 orbits for the solar system).

In a sense, SPOCK gives reasonable answers for these systems. If you pass the solar system to the FeatureClassifier, it tells you it should be stable over 10^9 orbits, which is the right answer. Secular chaos takes a long time. But if you started to take super-Jupiter planets with period ratios of 2-3 to make secular instabilities happen on shorter timescales (e.g. secular chaos for producing hot Jupiters), the features in our FeatureClassifier, and our training set in general, do not have the information to correctly classify those, so I would guess they'd get many of those systems wrong.

franpoz commented 3 years ago

Thanks a lot for all your answers Daniel!

MilesCranmer commented 3 years ago

Some extra notes: I noticed your tmax is different between the different models. The FeatureClassifier has tmax=1e9 fixed, but NbodyRegressor and DeepRegressor you can change this (default is 1e9).

I'll also note that you can change the number of samples in DeepRegressor to get more accurate results, with, e.g., samples=10000.

By the way, here are the performance comparisons on resonant and non-resonant datasets: "Ours" is the DeepRegressor - https://arxiv.org/abs/2101.04117, and "Tamayo+20" is the FeatureClassifier. Note they are pretty comparable at classification accuracy at the 1e9 mark, but the DeepRegressor has much lower bias.

franpoz commented 3 years ago

Great! thanks a lot Miles!

MilesCranmer commented 3 years ago

@franpoz I just pushed a fix for something that might have affected your results. NbodyRegressor returns results in terms of simulation units, and now DeepRegressor does this too. Before, DeepRegressor was returning results in units of minP, meaning you would only get consistency between the methods if minP = 1. But now this is fixed!

Cheers, Miles

dtamayo commented 3 years ago

Fixed in version 1.3.0

dtamayo / spock

which method is more reliable? #16