Neural Density Estimation

AstroJacobLi commented 2 years ago

Now we have a forward model $f(\vec{\theta})$ (an NN emulator) which could return the SED of given physical parameters. Then we need to build a neural density estimator to describe the $P(\vec{\theta}|\vec{x_0})$, here $\vec{x_0}$ is not a single spectrum, but an ensemble of observed spectra.

As @changhoonhahn suggested, I tried to use the package sbi to do the above thing, but I didn't find a way to evaluate the posterior at a given ensemble of spectra. It seems the provided examples are all about evaluating the posterior for a given spectrum. @changhoonhahn It would be great if you can take a look at https://github.com/AstroJacobLi/popsed/blob/main/notebooks/neural_density_estimator/sbi_sed.ipynb

AstroJacobLi commented 2 years ago

Another question to @pmelchior and @changhoonhahn : assume I sample 100 $\theta$ from a $P(\theta)$, then I generated 100 spectra based on these $\theta$s. How could I compare these spectra with real observed spectra? i.e., How to calculate high-dimensional KL-divergence?

changhoonhahn commented 2 years ago

Besides what we discussed during the group meeting yesterday, here's a kth nearest neighbor estimator for KL divergence in case you're interested: https://github.com/changhoonhahn/nevin/blob/master/nevin/nevin.py#L91

I used divergence estimators in an old paper of mine but as Peter said they can be finicky: https://ui.adsabs.harvard.edu/abs/2019MNRAS.485.2956H/abstract

AstroJacobLi commented 2 years ago

Good news: if using logP as loss and train for a certain amount of epochs, I get a reasonable good result:
Bad news: if train the NF for long enough time, the $P(\theta)$ will just converge to a very tiny region, see the blue contours below. This is because the log P loss is WRONG in our case. Let's write the observed magnitudes distribution as $P(x)$, and the model magnitudes distribution as $Q(x)$ (where $x$ is from sampling $P_{\rm NF}(\theta)$ then forward model). $P(x)$ is already fit by an NDE, so we can evaluate it at any given $xi$; but $Q(x)$ is not fit by NDE. The KL divergence goes as $D{\rm KL} = \sum Q(x)\log(P(x) / Q(x)) = \sum_{x_i \sim Q(x)} \log P(xi) - \sum{x_i \sim Q(x)} \log Q(x_i)$. The first term can be calculated, but the second term (entropy of Q) cannot be calculated. So, the logP loss is wrong.
Solution: I'm trying to write @changhoonhahn 's KNN estimator for KL divergence in PyTorch, such that gradient back-prop could work. Or, if there is a non-parametric way to evaluate the entropy of $Q(x)$, we can just add it to the logP loss.

changhoonhahn commented 2 years ago

Hmmm, I see. Flipping the divergence doesn't change the fact that we still need to evaluate Q(x_i)... This might not be a good idea, but could you approximate

$Q(x_i) = \int P(xi | \theta) P(\theta) d\theta \approx \sum{\thetai \sim P{\rm NF}} P_{\rm NF}(x_i | \theta_i)$

With that said, it might be easier to use a kNN estimator. This faiss package I ran into a few months ago might be useful for implementing the kNN estimator: https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

pmelchior commented 2 years ago

Just one correction: I think, Chang, you didn't mean to use the same index in the MC approximation. It should be,

$Q(x_i) = \int Q(xi | \theta) P{\rm NF} (\theta) d\theta \approx \sum_{\thetaj \sim P{\rm NF}} Q(x_i | \theta_j)$

where $Q(x | \theta)$ is an amortized likelihood (model).

pmelchior commented 2 years ago

I have now (mostly) convinced myself that the likelihood model is the way to go. Here's my thinking. What you want is the posterior

$$ p(\theta\mid\lbrace x\rbrace) = p(\lbrace x\rbrace \mid\theta)p(\theta) $$

Let's ignore the prior term, then what's left is a pure likelihood model. The problem in this issue arises because $p(\lbrace x\rbrace)$ (or similar constructions) is actually a distribution over distributions! It's a statement about the probability of a particular photometry distribution. But we don't need to do this at all. Instead, we can do the usual expansion for independent data:

$$ p(\lbrace x\rbrace \mid\theta) = \prod_i p(x_i \mid\theta) $$

and every individual factor is simply a likelihood evaluation for a given parameter. The 3rd option of SBI. And it has the great advantage of allowing the set $\lbrace x\rbrace$ to be defined by the user.

changhoonhahn commented 2 years ago

I think that's going back to combining individual likelihoods/posteriors. In fact, I think that's a good reason for not doing my MC approximation suggestion.

Both approaches can give us $p(\theta \mid \lbrace x_o \rbrace)$, but I think they are complementary.

Combining individual likelihoods/posteriors is nice because you can vary $\lbrace x_o \rbrace$ but it requires an accurate amortized neural density estimator (ANDE) of the full likelihood or posterior distributions (i.e. conditional distributions). In the current toy parameter space, building an ANDE is quite straightforward. But in more realistic scenarios this can be quite challenging to train an ANDE that accurately estimates the likelihood/posterior over the full parameter space, as we've found.
On the other hand if we take the current approach, we're effectively doing the following $$p(\theta \mid \lbrace x_o \rbrace) \approx p(\theta \mid {\rm argmin} \rho( \lbrace xo \rbrace, F(\lbrace \theta \rbrace)))$$ This doesn't require any ANDE of conditional distributions. You only need an NDE to estimate $p{\rm NF} (\theta)$ and some distance metric to enforce the conditional statement. In our case, we're using KL divergence as the distance metric.

KL divergence is well-motivated but if we're more liberal about the distance metric, then 2 can much more easily be scaled to higher dimensional data vectors and can be an easier density estimation problem to solve to get $p(\theta \mid \lbrace x_o \rbrace)$

AstroJacobLi commented 2 years ago

Thanks for the thoughts! Well, it seems neither ANDE nor distance metrics are easy when $\theta$ goes to higher dimension and when noise is added. I do agree that the likelihood method has the advantage that once we have good ANDE, we can take different SED samples without new training. For the toy model here, I expect both ways work. But if we add (complex) noise, I guess the distance method might be more straightforward than constructing an ANDE?

AstroJacobLi commented 2 years ago

Wait, if we really have a nice ANDE (i.e., actually build an emulator for prospector), why do we need to do a population-level inference? We can just infer the posterior for each galaxy, then combine posteriors. It is because building ANDE is hard, so we want to use some other methods to do population-level inference.

changhoonhahn commented 2 years ago

If we have a perfect ANDE, which Peter and I are working on, then we don't need population-level inference. But so far, and as we saw in class this week, ANDEs are not perfect and their performance can vary significantly across parameter- and data-space.

So I think it's good to have a complementary approach. Combining the likelihoods from an imperfect ANDE runs the risk of propagating the ANDE biases into $p(\theta \mid \lbrace x \rbrace)$.

AstroJacobLi commented 2 years ago

While faiss implementation of KNN-based KL divergence is 3x faster than sklearn, it is still hard to make KL divergence differentiable. faiss doesn't support full tensor-based search in some functions (such as range_search).

AstroJacobLi commented 2 years ago

Okay, I made a differentiable KL divergence at https://github.com/AstroJacobLi/popsed/blob/dd778a1b1b79a6a4ece465988bde5b44d8ab9489/popsed/nde.py#L350. The KNN here is done with brutal force, so it's slow (for two 5-D distributions with 10000 samples respectively, the computing time for D_KL(X, Y) is about 13s on CPU). Using Chang's sklearn implementation, the computing time is similar (12s for the same case).

AstroJacobLi commented 2 years ago

Although I've implemented a differentiable KL divergence, the gradient easily explodes during training, since the KL divergence is based on a poor-written brutal-force-based kNN estimator. I spent a lot of time tuning the learning rate etc, but gradients still explode.

The KL divergence has trouble in high-dimensional space, especially when the supports of two distributions don't overlap too much. After a random dinner chat with my friend in EECS department (expert in GAN), I figured out that we can use Wasserstein distance (or equivalently, Sinkhorn distance) for our purpose.

Using geomloss, the differentiable Sinkhorn distance can be calculated easily. Training only takes 250 epochs (~2min on GPU). Then, the results are very encouraging!! Now I think this is the right direction to go.

changhoonhahn commented 2 years ago

That looks very promising! This is without a noise model or much realism, right?

We can think a bit more about how we want to compare $F(\lbrace \theta \rbrace)$ to $\lbrace x_o \rbrace$, but I think a distance metric like the one you used is a good way to go for now.

pmelchior commented 2 years ago

Nice! I'd like to see how you used geomloss (I've tried this for a different project but couldn't get it work).

One thing that's odd: the normalization of the distributions seems clearly lower for data than for the model. Is this a plotting artifact?

AstroJacobLi commented 2 years ago

My next steps are:

Under the current 2-parameter model, try to introduce redshift scatter, and compare the recovered $P(\theta)$.
Add more realistic SPS parameters: stellar mass, dust, metallicity, etc.

AstroJacobLi commented 2 years ago

Now stellar mass and redshift are added into the framework. However, everything is still noise-free. Below shows a two-component Gaussian mixture model in tage and log(tau) space. The whole population (5000 mock galaxies) has Gaussian stellar mass and Gaussian redshift distribution. We try to recover $P(\theta)$ from SDSS-griz photometry. (u-band is not used since the emulator doesn't cover ~2000A, which will be redshifted to u-band at z=0.1).

Overall speaking, popsed does a good job on recovering $P(\theta)$, especially on recovering stellar mass and redshift distribution. The constraining power on tage is weaker though.

AstroJacobLi commented 2 years ago

As we discussed last Friday, I tried to combine several (trained) NDEs and obtain a "distribution" of $P(\theta)$. The results are shown below, using SDSS ugriz photometry, still noise-free. The way of combining NDEs is:

Train 10 NDEs (random realizations of Gaussians) with the same training hyperparams.
Sample 5000 points from each NDE, as $\theta_i^{K}$, where $K$ denotes different NDEs.
Calculate the 1-D marginalized histogram for each NDE. The shaded region shows the max and min of the distribution of these histograms.
Combine samples from all NDEs together, plot 2-D histograms. Now I think this plot has taken both the sampling error and initialization error into account. Generally good agreement with the ground truth, but the constraint on the left peak of log tau is still weak. I don't think we can fully recover this peak using ugriz photometry.

AstroJacobLi commented 2 years ago

I think I did one thing extremely wrong during the past experiments. The normalizing flow models the distribution of physical params $\theta$. And the NF is initialized to be a standard Gaussian centered at zero. In the past, I actually centered the initial NF to the center of $\theta$, which should not be provided in the real application. This is the reason why we got good results in all figures above.

If NF is initialized to be standard Gaussian, and not shifted, I got very ugly results... Maybe chat in-depth on Friday

AstroJacobLi commented 2 years ago

As we discussed, the below figure demonstrates the randomness of different initial Gaussians and the optimization processes. Now it looks uglier than before, but it makes more sense. The shaded region now encompasses the truth.

changhoonhahn commented 2 years ago

I think this looks great! Are the 2D contours the mean distribution of multiple models or a single model?

Also, the third subpopulation in the bottom left corner of the tage-logtau plane is interesting. One of the interesting things we might find when we apply this to real data is these unexpected subpopulations that cannot be ruled out by observations.

AstroJacobLi commented 2 years ago

I think I did some interesting experiments. Using the 6-parameter model (I call that TZD, tau-metallicity-dust model) without noise, I compare the ability to recover underlining parameter distribution between HSC grizy and GALEX NUV + HSC + JWST (3 NIR filters). The results are as expected: with more bands added, we get better results. With HSC, the redshift distribution is actually flat, so is the metallicity distribution. Adding NUV and a few NIR filters makes the inferred redshift/metallicity distribution much tighter, although the redshift distribution is biased towards low-z. In either case, the subtle mass-metallicity relation cannot be recovered.

changhoonhahn commented 2 years ago

That makes sense. NIR should help constraint dust and break degeneracies.

I find it impressive that we can recover stellar mass accurately for both cases.

AstroJacobLi commented 2 years ago

Last week we discussed whether our results are mainly determined by the initialization. The figure below shows the case (SDSS with NSA noise) when the tage and log(tau) initial positions follow uniform priors (tage ~ U(3, 10), logtau ~ U(-1, 1)). Other parameters still follow a Gaussian prior. I think now it's clear that our data has no constraining power on tage and probably log(Z) (metallicity). But has good constraining power on log(tau) even though the prior is flat.

AstroJacobLi commented 2 years ago

Just saw another package that might be a substitute for geomloss: https://ott-jax.readthedocs.io/en/latest/

pmelchior commented 2 years ago

looks interesting, and fast. It does run in jax, which I like on its own a lot, but there's so much pytorch code (e.g. the normalizing flows), which makes me hesitant to adopt jax right now.

AstroJacobLi / popsed

Neural Density Estimation #3