Open AstroJacobLi opened 2 years ago
Another question to @pmelchior and @changhoonhahn : assume I sample 100 $\theta$ from a $P(\theta)$, then I generated 100 spectra based on these $\theta$s. How could I compare these spectra with real observed spectra? i.e., How to calculate high-dimensional KL-divergence?
Besides what we discussed during the group meeting yesterday, here's a kth nearest neighbor estimator for KL divergence in case you're interested: https://github.com/changhoonhahn/nevin/blob/master/nevin/nevin.py#L91
I used divergence estimators in an old paper of mine but as Peter said they can be finicky: https://ui.adsabs.harvard.edu/abs/2019MNRAS.485.2956H/abstract
Good news: if using logP as loss and train for a certain amount of epochs, I get a reasonable good result:
Bad news: if train the NF for long enough time, the $P(\theta)$ will just converge to a very tiny region, see the blue contours below. This is because the log P loss is WRONG in our case. Let's write the observed magnitudes distribution as $P(x)$, and the model magnitudes distribution as $Q(x)$ (where $x$ is from sampling $P_{\rm NF}(\theta)$ then forward model). $P(x)$ is already fit by an NDE, so we can evaluate it at any given $xi$; but $Q(x)$ is not fit by NDE. The KL divergence goes as $D{\rm KL} = \sum Q(x)\log(P(x) / Q(x)) = \sum_{x_i \sim Q(x)} \log P(xi) - \sum{x_i \sim Q(x)} \log Q(x_i)$. The first term can be calculated, but the second term (entropy of Q) cannot be calculated. So, the logP loss is wrong.
Solution: I'm trying to write @changhoonhahn 's KNN estimator for KL divergence in PyTorch, such that gradient back-prop could work. Or, if there is a non-parametric way to evaluate the entropy of $Q(x)$, we can just add it to the logP loss.
Hmmm, I see. Flipping the divergence doesn't change the fact that we still need to evaluate Q(x_i)... This might not be a good idea, but could you approximate
$Q(x_i) = \int P(xi | \theta) P(\theta) d\theta \approx \sum{\thetai \sim P{\rm NF}} P_{\rm NF}(x_i | \theta_i)$
With that said, it might be easier to use a kNN estimator. This faiss
package I ran into a few months ago might be useful for implementing the kNN estimator: https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
Just one correction: I think, Chang, you didn't mean to use the same index in the MC approximation. It should be,
$Q(x_i) = \int Q(xi | \theta) P{\rm NF} (\theta) d\theta \approx \sum_{\thetaj \sim P{\rm NF}} Q(x_i | \theta_j)$
where $Q(x | \theta)$ is an amortized likelihood (model).
I have now (mostly) convinced myself that the likelihood model is the way to go. Here's my thinking. What you want is the posterior
$$ p(\theta\mid\lbrace x\rbrace) = p(\lbrace x\rbrace \mid\theta)p(\theta) $$
Let's ignore the prior term, then what's left is a pure likelihood model. The problem in this issue arises because $p(\lbrace x\rbrace)$ (or similar constructions) is actually a distribution over distributions! It's a statement about the probability of a particular photometry distribution. But we don't need to do this at all. Instead, we can do the usual expansion for independent data:
$$ p(\lbrace x\rbrace \mid\theta) = \prod_i p(x_i \mid\theta) $$
and every individual factor is simply a likelihood evaluation for a given parameter. The 3rd option of SBI. And it has the great advantage of allowing the set $\lbrace x\rbrace$ to be defined by the user.
I think that's going back to combining individual likelihoods/posteriors. In fact, I think that's a good reason for not doing my MC approximation suggestion.
Both approaches can give us $p(\theta \mid \lbrace x_o \rbrace)$, but I think they are complementary.
Combining individual likelihoods/posteriors is nice because you can vary $\lbrace x_o \rbrace$ but it requires an accurate amortized neural density estimator (ANDE) of the full likelihood or posterior distributions (i.e. conditional distributions). In the current toy parameter space, building an ANDE is quite straightforward. But in more realistic scenarios this can be quite challenging to train an ANDE that accurately estimates the likelihood/posterior over the full parameter space, as we've found.
On the other hand if we take the current approach, we're effectively doing the following $$p(\theta \mid \lbrace x_o \rbrace) \approx p(\theta \mid {\rm argmin} \rho( \lbrace xo \rbrace, F(\lbrace \theta \rbrace)))$$ This doesn't require any ANDE of conditional distributions. You only need an NDE to estimate $p{\rm NF} (\theta)$ and some distance metric to enforce the conditional statement. In our case, we're using KL divergence as the distance metric.
KL divergence is well-motivated but if we're more liberal about the distance metric, then 2 can much more easily be scaled to higher dimensional data vectors and can be an easier density estimation problem to solve to get $p(\theta \mid \lbrace x_o \rbrace)$
Thanks for the thoughts! Well, it seems neither ANDE nor distance metrics are easy when $\theta$ goes to higher dimension and when noise is added. I do agree that the likelihood method has the advantage that once we have good ANDE, we can take different SED samples without new training. For the toy model here, I expect both ways work. But if we add (complex) noise, I guess the distance method might be more straightforward than constructing an ANDE?
Wait, if we really have a nice ANDE (i.e., actually build an emulator for prospector
), why do we need to do a population-level inference? We can just infer the posterior for each galaxy, then combine posteriors. It is because building ANDE is hard, so we want to use some other methods to do population-level inference.
If we have a perfect ANDE, which Peter and I are working on, then we don't need population-level inference. But so far, and as we saw in class this week, ANDEs are not perfect and their performance can vary significantly across parameter- and data-space.
So I think it's good to have a complementary approach. Combining the likelihoods from an imperfect ANDE runs the risk of propagating the ANDE biases into $p(\theta \mid \lbrace x \rbrace)$.
While faiss
implementation of KNN-based KL divergence is 3x faster than sklearn
, it is still hard to make KL divergence differentiable. faiss
doesn't support full tensor-based search in some functions (such as range_search
).
Okay, I made a differentiable KL divergence at https://github.com/AstroJacobLi/popsed/blob/dd778a1b1b79a6a4ece465988bde5b44d8ab9489/popsed/nde.py#L350. The KNN here is done with brutal force, so it's slow (for two 5-D distributions with 10000 samples respectively, the computing time for D_KL(X, Y) is about 13s on CPU). Using Chang's sklearn
implementation, the computing time is similar (12s for the same case).
Although I've implemented a differentiable KL divergence, the gradient easily explodes during training, since the KL divergence is based on a poor-written brutal-force-based kNN estimator. I spent a lot of time tuning the learning rate etc, but gradients still explode.
The KL divergence has trouble in high-dimensional space, especially when the supports of two distributions don't overlap too much. After a random dinner chat with my friend in EECS department (expert in GAN), I figured out that we can use Wasserstein distance (or equivalently, Sinkhorn distance) for our purpose.
Using geomloss
, the differentiable Sinkhorn distance can be calculated easily. Training only takes 250 epochs (~2min on GPU). Then, the results are very encouraging!! Now I think this is the right direction to go.
That looks very promising! This is without a noise model or much realism, right?
We can think a bit more about how we want to compare $F(\lbrace \theta \rbrace)$ to $\lbrace x_o \rbrace$, but I think a distance metric like the one you used is a good way to go for now.
Nice! I'd like to see how you used geomloss (I've tried this for a different project but couldn't get it work).
One thing that's odd: the normalization of the distributions seems clearly lower for data than for the model. Is this a plotting artifact?
My next steps are:
Now stellar mass and redshift are added into the framework. However, everything is still noise-free. Below shows a two-component Gaussian mixture model in tage
and log(tau)
space. The whole population (5000 mock galaxies) has Gaussian stellar mass and Gaussian redshift distribution. We try to recover $P(\theta)$ from SDSS-griz photometry. (u-band is not used since the emulator doesn't cover ~2000A, which will be redshifted to u-band at z=0.1).
Overall speaking, popsed
does a good job on recovering $P(\theta)$, especially on recovering stellar mass and redshift distribution. The constraining power on tage
is weaker though.
As we discussed last Friday, I tried to combine several (trained) NDEs and obtain a "distribution" of $P(\theta)$. The results are shown below, using SDSS ugriz
photometry, still noise-free. The way of combining NDEs is:
log tau
is still weak. I don't think we can fully recover this peak using ugriz
photometry. I think I did one thing extremely wrong during the past experiments. The normalizing flow models the distribution of physical params $\theta$. And the NF is initialized to be a standard Gaussian centered at zero. In the past, I actually centered the initial NF to the center of $\theta$, which should not be provided in the real application. This is the reason why we got good results in all figures above.
If NF is initialized to be standard Gaussian, and not shifted, I got very ugly results... Maybe chat in-depth on Friday
As we discussed, the below figure demonstrates the randomness of different initial Gaussians and the optimization processes. Now it looks uglier than before, but it makes more sense. The shaded region now encompasses the truth.
I think this looks great! Are the 2D contours the mean distribution of multiple models or a single model?
Also, the third subpopulation in the bottom left corner of the tage-logtau plane is interesting. One of the interesting things we might find when we apply this to real data is these unexpected subpopulations that cannot be ruled out by observations.
I think I did some interesting experiments. Using the 6-parameter model (I call that TZD, tau-metallicity-dust model) without noise, I compare the ability to recover underlining parameter distribution between HSC grizy and GALEX NUV + HSC + JWST (3 NIR filters). The results are as expected: with more bands added, we get better results. With HSC, the redshift distribution is actually flat, so is the metallicity distribution. Adding NUV and a few NIR filters makes the inferred redshift/metallicity distribution much tighter, although the redshift distribution is biased towards low-z. In either case, the subtle mass-metallicity relation cannot be recovered.
That makes sense. NIR should help constraint dust and break degeneracies.
I find it impressive that we can recover stellar mass accurately for both cases.
Last week we discussed whether our results are mainly determined by the initialization. The figure below shows the case (SDSS with NSA noise) when the tage
and log(tau)
initial positions follow uniform priors (tage ~ U(3, 10), logtau ~ U(-1, 1)
). Other parameters still follow a Gaussian prior. I think now it's clear that our data has no constraining power on tage
and probably log(Z)
(metallicity). But has good constraining power on log(tau)
even though the prior is flat.
Just saw another package that might be a substitute for geomloss
: https://ott-jax.readthedocs.io/en/latest/
looks interesting, and fast. It does run in jax, which I like on its own a lot, but there's so much pytorch code (e.g. the normalizing flows), which makes me hesitant to adopt jax right now.
Now we have a forward model $f(\vec{\theta})$ (an NN emulator) which could return the SED of given physical parameters. Then we need to build a neural density estimator to describe the $P(\vec{\theta}|\vec{x_0})$, here $\vec{x_0}$ is not a single spectrum, but an ensemble of observed spectra.
As @changhoonhahn suggested, I tried to use the package
sbi
to do the above thing, but I didn't find a way to evaluate the posterior at a given ensemble of spectra. It seems the provided examples are all about evaluating the posterior for a given spectrum. @changhoonhahn It would be great if you can take a look at https://github.com/AstroJacobLi/popsed/blob/main/notebooks/neural_density_estimator/sbi_sed.ipynb