PROTEUS grid-search - Githubissues

timlichtenberg commented 5 days ago

For moving towards an inverse method of PROTEUS sometime down the road, we need to consider a computationally feasible approach to run many models to fit a given set of observations.

To give an example of the problem: Let's assume a given exoplanet has the following known/observed parameters with uncertainties: stellar age, orbital distance, planet radius, planet mass, transmission/emission spectrum. Given these parameters, we would like to compute the best-fitting PROTEUS models over a set of input parameters, and then compute a goodness-of-fit metric. This is essentially the description of an atmospheric retrieval, only that PROTEUS simulations are way too computationally expensive as to perform 100k+ simulations.

I am not certain yet what is the best strategy to approach this problem. Here are a few that have some opportunities and drawbacks:

A modified chi-squared or r-squared algorithm to compute some measure of goodness-of-fit for an arbitrary grid. E.g. Madhusudhan & Seager (2009).
Train a machine learning model on the simulation data and use the machine learning model for the retrieval, e.g., Ardevol Martinez et al. (2024).
Brute-force retrieval approach, e.g., nested sampling, MCMC, or some other random sampler, ignoring the computational cost, and possibly achieving a pretty low-confidence result.

nichollsh commented 4 days ago

I agree that this would be incredibly powerful. I can imagine that running an MCMC (or similar method) would be tricky because of the slow runtimes. When we are ready to look into this, maybe we could involve someone who has experience doing retrievals with large models?

nichollsh commented 4 days ago

The ML paper you cited is interesting - they ran 50k simulations to train the model. I am finding that a grid of 22 simulations takes about 14 hours to run (on 22 threads). If we scaled this to 50k simulations on 256 threads this would take 50000*14/256 = 114 days. We could of course speed this up by reducing the resolution, etc.

timlichtenberg commented 4 days ago

I believe they need fewer simulations than a "normal" Bayesian model, which is one of their selling points. Nevertheless, even 100k simulations are not impossible when using a large-scale computing facility. We can and should do this sometime in the next year to achieve a large simulation grid, once the current plans with aragog and zephyrus are done. Cosmology solves this problem by running updated large-scale forward models every few years with high-performance codes (e.g. TNG project) and then using these models to train machine learning on them. This is a way to go, but if we can find an algorithm that enables running highly specialised simulations to compute the Bayesian evidence directly for a single planet on ~week(s) timescale, this would be preferable I think.

FormingWorlds / PROTEUS

PROTEUS grid-search #204