Autostronomy / AstroPhot

A fast, flexible, automated, and differentiable astronomical image 2D forward modelling tool for precise parallel multi-wavelength photometry
https://astrophot.readthedocs.io
GNU General Public License v3.0
79 stars 9 forks source link

Possibly don't need to optimize magnitude parameter for models #140

Open ConnorStoneAstro opened 8 months ago

ConnorStoneAstro commented 8 months ago

Is your feature request related to a problem? Please describe. In models such as Sersic there is a very strong covariance between n, Re, and Ie. This strong covariance may in fact be a perfect plane in parameter space. Imagine if n and Re were optimized, one could directly determine Ie in a single step since it is a single linear parameter of the residuals.

Describe the solution you'd like Its not clear what this would look like operationally. Ie would no longer be a parameter which is optimized, instead it could be determined as a final step after all other effects have been accounted for. This would also be potentially challenging for some model types to include.

Describe alternatives you've considered Currently the Ie parameter is fit just like everything else, though this may mean that the whole fitting process is occuring on a plane in parameter space which could be simplified.

entaylor commented 3 months ago

This is correct, and is something that i have suggested for scarlet as well. In that context, i also found that supplying the linear result actually reduced the error by a small amount, too, since you are no longer limited by your sampler's ability to find the precisely optimal solution. This only happens where the statistical uncertainties are very small, but that can still be important: think flux calibrators!!

The nice thing is that both amplitude and the uncertainty are easily and analytically computable through simple linear algebra. The only trick is multiple component models, where some matrix inversions are necessary. I guess the difficulty is in the partial derivatives with other parameters, but given its all linear, i think that this precisely what jax is supposed to do, right? (Does that extend to matrix inversions?)

ConnorStoneAstro commented 3 months ago

Thanks @entaylor ! For AstroPhot, the primary fitting algorithm is Levenberg-Marquardt which, when things are going well, does actually treat the model as linear and therefore should land on the exact optimal solution. As you mention though, this can extend to multi component models, where things can get more challenging and so having the ability to remove a bunch of parameters that can trivially be optimized as a separate problem is really nice! And yes autodiff codes (PyTorch for AstroPhot) can compute derivatives and get GPU acceleration for matrix inversion, which is pretty neat.

The main problem is implementation. Doing this would significantly increase the complexity of the fitting code, which is already quite a handful. I have some ideas for some updates this year which will simplify a few key functionalities in AstroPhot, which would make it easier to play games of fitting subsets of parameters sequentially (some of this capability is already available in the Iter_LM fitter). So I'll be revisiting this issue when more pieces are in place :) If you are interested in being involved that would be great! I'd be happy to have some help with the code :)

entaylor commented 3 months ago

All makes good sense. (At least in the abstract! I won't pretend to understand the implementation challenges!)

As a smaller suggestion that would give the accuracy advantage if not the speed-up would just to (optionally?) compute the linear solution for flux parameters at the end of the fit, and then maybe call that 'polish' function at the end of any multi-component fit (including background).

Happy to help as i can!

ConnorStoneAstro commented 3 months ago

Ah, this is interesting. Do you think it is possible to ignore the flux parameters entirely until the end? Or would they need to be updated at each iteration? I thought that the flux parameters would need to be updated at each step, you could imagine if the brightness is way off then the fitter may decide that pushing the model out of frame is the best way to reduce chi^2. But if there is some way to show that these can be fully decoupled without messing up the fit then waiting until the end to fit the fluxes would be a really big speedup :)

entaylor commented 3 months ago

Hm ... probably a voice conversation will be worthwhile, but just to keep thinking through the keyboard a little ...

I'm more used to thinking about MCMC sampling rather than optimisation, so forgive me for couching what follows in those terms. The trick is that flux(es) can be deterministically computed given other parameters. That means that flux does not have to be considered as a free parameter on the same footing as others; it is sort of a secondary output of the model. In emcee's convention, flux would be something you would track through a blob.

Trying to think more about an optimisation setting, you don't actually care about the 'actual' value of the flux until the end, right? So you would just eliminate any flux parameters from the set of parameters to be optimised.

Thinking about each model component as a distinct class/object, another way of viewing this is that a model can be fully defined by position, size, ellipticity, and shape parameter, and then flux follows from those parameters applied to the image. So your 'generate_model_image' method would start with a 'generate_normalised_model' function, and then need to include a 'compute_maxlike_flux' method and finally to scale the normalised model by the maxlike flux to get the best fit model conditional on all the other parameters. And then you can evaluate the chi2 or likelihood then.

I can see issues with the uncertainties. It is straightforward to compute the variance-covariance among flux parameters with everything else held fixed. But you would want to make use of some autodiff something to get the full covariance between flux and non-flux parameters.

Following that line of thought ... could you have a switch in a model as to whether or not to compute flux deterministically or to treat flux as a free parameter? For optimisation, you switch the switch to 'flux as deterministic', and forget about any flux-related derivatives (which are anyway only important for moving through parameter space). Once that is done and you are happy you are at an optimal solution, you switch the switch to 'flux as free' to get the derivates that you need for the Jacobian and variance-covariance matrix.

Maybe? It sort of feels simple in principle, but i am sure that the difficulty is the plumbing!

ConnorStoneAstro commented 3 months ago

Ah, I understand now. That would be really easy to implement for a single model even now. But for group models (and even more for joint models) that would be pretty challenging... I'll have to think about this more. Another way I could do this inside the LM algorithm would be to compute the full jacobian, then decompose it into a flux_jacobian and a general_jacobian and my LM step would become two steps, a linear flux step which is exact every time, and a general step which moves the rest of the parameters. This would also require some careful thinking to implement correctly.

There is an added complication right now that the flux parameters are represented in log space. I thought it was clever since it means you can't get negative fluxes, however it does break the linearity of the flux and so means it wouldn't be a one step solution without some extra work.... I could just change to linear flux if it really speeds things up a lot.

Indeed we could set up a call to discuss this, but maybe after I've had some time to get a few other things sorted and prototype some ideas!

ConnorStoneAstro commented 3 months ago

I've been thinking about this some more. I now think that in the LM algorithm this is all handled automatically. The Hessian encodes these linear relationships, so they are already handled in the update step. The place where this would really be helpful is with gradient descent and Hamiltonian Monte Carlo (or NUTS). Since the parameters could follow more precisely the hyperplane that gives valid results, we could use a larger step size and the algorithm wouldn't complain. However, I already hacked the Pyro HMC and NUTS implementation to take a Hessian as the mass matrix, so this may already be accounted for.

I could perhaps do this more explicitly by getting the eigen decomposition of the Hessian. Any eigenvalues that are zero would tell me there's a linear relationship between some parameters and the eigenvector would give the relation. This is nice because it can be used to find any analytic relation, not just the ones I know about manually.