Forward modeling - Githubissues

AstroJacobLi commented 3 years ago

An effective and efficient forward model is required for population-level stellar population inference. Let's start with a toy model.

The simplest case is only varying stellar mass and current SFR, with everything else (including redshift zred, metallicity logzsol, age of forming stars tage) fixed. Currently, they are fixed at zred=0.05, logzsol=-0.5, tage=12.0 Gyr, which means that the galaxy started to form stars at $z=6$. For SFH, I use the simple tau model here. I sampled the mass-tau plane using an even grid, which looks like the following figure on the mass-SFR plane. SFR is integrated over the past 50 Myr, which is consistent with most UV+NIR SFR probes. I think this setup is reasonable for our current purpose. What do you think @changhoonhahn ?

Then, in principle, we can generate SEDs/spectra for each point on the mass-SFR plane, and train a differentiable emulator on them. The figure below shows SEDs (in SDSS filters) for 10 galaxies with the same stellar mass (5e10) but different tau (ranging from 0.1 Gyr to 10 Gyr). This process is not fast: the average time for 1 galaxy is 1 min on my laptop. Is there any way to accelerate this process? Cache the SSP or something?

If you think generating the training sample in this way works, I can start to generate more SEDs and try to train an emulator.

I borrowed some code from Ben Johnson's prospector and exspect to calculate SFRs and generate mock SEDs. The code is in popsed/mock.py.

AstroJacobLi commented 3 years ago

Aha, I profiled the code and found that we don't need to initialize the ssp object every time. Now the speed is much faster: 20s for 1000 SEDs with different stellar mass, tau, and tage. Thus, we can generate a considerable amount of training data in ~hours.

changhoonhahn commented 3 years ago

@AstroJacobLi this looks like a great start. One small thing, you don't need to include M* in the emulator since it's a scaling factor. Also, tau model is good to get started but we'll likely want a better SFH parameterization.

Yes, you only initialize StellarPopulations object once and,t in fact, you may run into issues if you have multiple ones to parallelize things: see https://github.com/dfm/python-fsps/issues/60 .

For general reference, one of my SED modeling python packages provabgs might be useful for this project.

AstroJacobLi commented 3 years ago

Yuan-Sen's paper seems interesting on how to populate the parameter space when doing typical SED fitting: https://ui.adsabs.harvard.edu/abs/2016ApJ...826...83T/abstract (but this paper is focused on stellar spectra, not for galaxies)

AstroJacobLi commented 3 years ago

I'm now working on (re)building the emulator based on speculator.

I built 1600 mock spectra using FSPS with only varying tau and tage. These spectra are used to train the emulator.
Following the speculator paper and code, I compress the 1600 spectra to 30 PCA components. To be specific, spectra are first transformed by taking log10, then normalized by subtracting the mean and dividing the std at each wavelength. The PCA compression works pretty well, see the re-construction error in the figure below. The PCA eigenvectors are also shown below.
Then I constructed an NN with three hidden layers (256 neurons in each layer). I also added BatchNorm1d and Dropout(p=0.2) to prevent overfitting. The input of NN (physical parameters, here are log tau and log tage) is not normalized, but the output of NN (PCA coefficients) is normalized such that the training is faster. The optimizer is Adam, with lr=0.002 and batch_size=128. After 200 epochs of training, the predicted spectra have reasonable overall shape and line depth, but wrong continuum. In other words, NN does a bad job predicting the first PCA coefficient (which corresponds to the continuum).
Because the first PCA coefficient is so large, a slight error of NN output will be amplified when multiplied by std(pca_coeff[0]) (since we've done normalization before feeding pca_coeff to NN). Therefore, I tried adding more weight to the first PCA coefficient in the MSELoss function, but didn't get much improvement. I didn't find any related solutions in the speculator repo. @pmelchior @changhoonhahn Is there a standard way to solve this issue?
In this regard, since the PCA-compressed vector always has an overwhelmingly-large element, I suppose using autoencoder will mitigate this issue.

pmelchior commented 3 years ago

When you say that the first PCA coefficient is large: how much larger is it compared to the other ones?

In general, neural nets don't like imbalance (like one coefficient being way higher than the others), but I'd ordinarily expect the dominant coefficient to be the one s that is best recovered. So, this is probably related to the normalization. Where is the std(pca_coeff[0]) normalization coming from?

AstroJacobLi commented 3 years ago

The figure below shows the PCA vector before and after normalization. The normalization is done as pca_coeff = (pca_coeff - np.mean(pca_coeff, axis=0)) / np.std(pca_coeff, axis=0).

pmelchior commented 3 years ago

I see. With this normalization you made it actively harder for the net to learn the bulk amplitude. Why do you think the normalization is necessary?

AstroJacobLi commented 3 years ago

But the same issue exists even without normalization of PCA coefficients: NN still doesn't learn the bulk amplitude.

AstroJacobLi commented 3 years ago

Seems like I solved the problem, by removing the Batch Normalization layer and Dropout layer from the NN 😂 PCA coefficients are still normalized before feeding into NN.

Below shows the loss curve, the prediction error on the validation set, and an example of the NN predicted spectrum. Everything looks reasonable now.

changhoonhahn commented 3 years ago

Those look like pretty good reconstruction errors. Should be more than sufficient for our purposes.

How do the fractional errors look below 3500A? You might want to quickly check that everything looks reasonable for the full wavelength range of the SDSS photometric bands.

pmelchior commented 3 years ago

@AstroJacobLi much better! So, BatchNormalization is often used, but it's very sketchy. You can search online for discussions on this topic. In many cases I've worked on batch norm wasn't necessary. I generally recommend avoiding it.

Also, it looks like the net could use a little more training. There's still a small bulk offset. What's your termination criterion?

AstroJacobLi commented 3 years ago

Those look like pretty good reconstruction errors. Should be more than sufficient for our purposes.

How do the fractional errors look below 3500A? You might want to quickly check that everything looks reasonable for the full wavelength range of the SDSS photometric bands.

Good point, I didn't generate mock spectra below 3500A in the first place. I forgot about the fact that u band is actually below 3500A. Will improve this soon.

AstroJacobLi commented 3 years ago

@AstroJacobLi much better! So, BatchNormalization is often used, but it's very sketchy. You can search online for discussions on this topic. In many cases I've worked on batch norm wasn't necessary. I generally recommend avoiding it.

Thanks for the information on BatchNormalization!

Also, it looks like the net could use a little more training. There's still a small bulk offset. What's your termination criterion?

Yes, I actually trained for 500 more epochs, and the loss decreases a little bit. But the small bulk offsets exist for most cases, have no idea how to mitigate this. Maybe we can just choose a model with both small loss and small bulk offsets (which we can calculate after training).

pmelchior commented 3 years ago

Repeat the training a bunch of times and evaluate + save each model. I never base decisions about failure on a single model instance (well, I don't do the anymore...)

AstroJacobLi commented 3 years ago

Caveat: should deal with unphysical inputs, such as negative age, etc. The current NN emulator still produces a spectrum for unphysical inputs.

pmelchior commented 3 years ago

Why: Isn't it unlikely/impossible that the generator will receive unphysical inputs?
How: what do you want to do when unphysical values are given?

AstroJacobLi commented 3 years ago

The $P(\theta)$ is parameterized by NF, so when the NF is not well-trained, if we sample $P(\theta)$, we might get some unphysical inputs. We wanna penalize the NF such that $P(\theta)$ has low power in unphysical regions. So I'm thinking that when the forward model gets an unphysical input, the output spectrum can just be [-99, -99, ..., -99].

pmelchior commented 3 years ago

I see. If the forward model bombs with some unphysical $\theta$, then this has to be sanitized. But you have to do it in such a way that the resulting spectrum looks very different from a physical one. It is not an option to not generate a spectrum for those $\theta$ values, otherwise the NF will never learn to avoid them.

AstroJacobLi commented 3 years ago

@pmelchior I'm now trying to add redshift to the emulator. It would be very helpful if you could share your code on convolving the transmission curve with the spectrum 😃

pmelchior commented 3 years ago

For reference: https://github.com/pmelchior/spectrum-encoder/blob/e817fbd3bef2a3a829ea289d4b6bef41265ad60d/model.py#L255-L257

AstroJacobLi commented 2 years ago

A differentiable FSPS is now on the Earth: https://arxiv.org/pdf/2112.06830.pdf, which might be helpful for us Code: https://github.com/ArgonneCPAC/dsps

changhoonhahn commented 2 years ago

Nice! Although at a glance the package does not seem particularly well documented.

I also ran into this paper, which might be similar to what we're doing: https://arxiv.org/abs/2002.09491

AstroJacobLi commented 2 years ago

I also ran into this paper, which might be similar to what we're doing: https://arxiv.org/abs/2002.09491

Yes! Recently I've been reading an ARAA paper on MCMC, and realized what we are doing is quick like the hierarchical Bayesian modeling, but parameterize the population parameters distribution using an NF (as you pointed out in this GW paper). Maybe this method could be powerful in the future.

AstroJacobLi commented 2 years ago

@changhoonhahn I have a question about generating the training set for the forward-model emulator. I'm trying to expand the framework to include tage, logtau, dust (dust2), and metallicity (logzsol). I'm going to sample this parameter space on a 404040*40 grid and generate SED using FSPS. But it took forever and the memory just exploded. How did you generate training SEDs for your PROVBAGS work? Did you use anything like MPI?

changhoonhahn commented 2 years ago

@AstroJacobLi I didn't use a grid. Instead, I constructed my training sample by sampling the prior space extensively so I didn't need to use MPI.

In your case, are you compressing the SEDs in any way or is your emulator predicting the full SED?

AstroJacobLi commented 2 years ago

@changhoonhahn My emulator predicts the PCA coeffs of the full spectrum. But anyway, I still need to generate a ton of SEDs to train the emulator. For sure, I can also turn the grid-sampling method to a prior-sampling thing. But I guess it still takes a long time to generate training SEDs. How long does it take to generate your training sample?

AstroJacobLi commented 2 years ago

Just write down some updates on training emulator for the new SPS model:

SPS model: mostly from Chang's PROVABGS. I use the Illustris-based 4-component NMF SFH, and one burst component. I fix the metallicity history to a constant. The dust model is the same as PROVABGS (three params). In total, including redshift (i.e., t_age), there are 11 parameters. Because the four SFH coeffs are generated by sampling a flat Dirichlet distribution, the degree of freedom is actually 3. So in total, there are 10 params as input for the emulator. I generated 2.39 * 10^6 spectra, covering 1000 AA to 60,000 AA. The spectra are split into 5 chunks in wavelength ('.w1000_2000', '.w2000_3600', '.w3600_5500', '.w5500_7410', '.w7410_60000' ), making it easier to train PCA and emulator.

PCA: PCA is trained for five wavelength chunks separately. The number of components for each chunk is [70, 50, 50, 50, 50]. The PCA recovery accuracy is generally very good (<0.2%) except for the very blue end (~1%).

Emulator: the architecture of the neuro net is similar to Chang's and Justin Alsing's. I used 4 FC layers with 256 neurons in each layer. It turned out that the batch_size in training is crucial for the speed of convergence. After several trials, batch_size=512 works best. The emulator for each wavelength chunk is trained for 500 epochs (~5 hours on Tiger using a single GPU). The fractional accuracy of the emulator (except for the very blue end) is now < 1%. However, for the blue end, the accuracy is ugly... Need to figure this out.

$NMF emu_emulator_ w3600_5500_frac_err$

$NMF emu_emulator_ w5500_7410_frac_err$

$NMF emu_emulator_ w7410_60000_frac_err$

AstroJacobLi / popsed

Forward modeling #1