Open AstroJacobLi opened 3 years ago
Aha, I profiled the code and found that we don't need to initialize the ssp
object every time. Now the speed is much faster: 20s for 1000 SEDs with different stellar mass, tau
, and tage
. Thus, we can generate a considerable amount of training data in ~hours.
@AstroJacobLi this looks like a great start. One small thing, you don't need to include M* in the emulator since it's a scaling factor. Also, tau model is good to get started but we'll likely want a better SFH parameterization.
Yes, you only initialize StellarPopulations
object once and,t in fact, you may run into issues if you have multiple ones to parallelize things: see https://github.com/dfm/python-fsps/issues/60 .
For general reference, one of my SED modeling python packages provabgs
might be useful for this project.
Yuan-Sen's paper seems interesting on how to populate the parameter space when doing typical SED fitting: https://ui.adsabs.harvard.edu/abs/2016ApJ...826...83T/abstract (but this paper is focused on stellar spectra, not for galaxies)
I'm now working on (re)building the emulator based on speculator
.
tau
and tage
. These spectra are used to train the emulator.speculator
paper and code, I compress the 1600 spectra to 30 PCA components. To be specific, spectra are first transformed by taking log10
, then normalized by subtracting the mean and dividing the std
at each wavelength. The PCA compression works pretty well, see the re-construction error in the figure below. The PCA eigenvectors are also shown below.
BatchNorm1d
and Dropout(p=0.2)
to prevent overfitting. The input of NN (physical parameters, here are log tau
and log tage
) is not normalized, but the output of NN (PCA coefficients) is normalized such that the training is faster. The optimizer is Adam
, with lr=0.002
and batch_size=128
. After 200 epochs of training, the predicted spectra have reasonable overall shape and line depth, but wrong continuum. In other words, NN does a bad job predicting the first PCA coefficient (which corresponds to the continuum).
std(pca_coeff[0])
(since we've done normalization before feeding pca_coeff
to NN). Therefore, I tried adding more weight to the first PCA coefficient in the MSELoss function, but didn't get much improvement. I didn't find any related solutions in the speculator
repo. @pmelchior @changhoonhahn Is there a standard way to solve this issue?When you say that the first PCA coefficient is large: how much larger is it compared to the other ones?
In general, neural nets don't like imbalance (like one coefficient being way higher than the others), but I'd ordinarily expect the dominant coefficient to be the one s that is best recovered. So, this is probably related to the normalization. Where is the std(pca_coeff[0])
normalization coming from?
The figure below shows the PCA vector before and after normalization. The normalization is done as pca_coeff = (pca_coeff - np.mean(pca_coeff, axis=0)) / np.std(pca_coeff, axis=0)
.
I see. With this normalization you made it actively harder for the net to learn the bulk amplitude. Why do you think the normalization is necessary?
But the same issue exists even without normalization of PCA coefficients: NN still doesn't learn the bulk amplitude.
Seems like I solved the problem, by removing the Batch Normalization layer and Dropout layer from the NN 😂 PCA coefficients are still normalized before feeding into NN.
Below shows the loss curve, the prediction error on the validation set, and an example of the NN predicted spectrum. Everything looks reasonable now.
Those look like pretty good reconstruction errors. Should be more than sufficient for our purposes.
How do the fractional errors look below 3500A? You might want to quickly check that everything looks reasonable for the full wavelength range of the SDSS photometric bands.
@AstroJacobLi much better! So, BatchNormalization
is often used, but it's very sketchy. You can search online for discussions on this topic. In many cases I've worked on batch norm wasn't necessary. I generally recommend avoiding it.
Also, it looks like the net could use a little more training. There's still a small bulk offset. What's your termination criterion?
Those look like pretty good reconstruction errors. Should be more than sufficient for our purposes.
How do the fractional errors look below 3500A? You might want to quickly check that everything looks reasonable for the full wavelength range of the SDSS photometric bands.
Good point, I didn't generate mock spectra below 3500A in the first place. I forgot about the fact that u
band is actually below 3500A. Will improve this soon.
@AstroJacobLi much better! So,
BatchNormalization
is often used, but it's very sketchy. You can search online for discussions on this topic. In many cases I've worked on batch norm wasn't necessary. I generally recommend avoiding it.
Thanks for the information on BatchNormalization
!
Also, it looks like the net could use a little more training. There's still a small bulk offset. What's your termination criterion?
Yes, I actually trained for 500 more epochs, and the loss decreases a little bit. But the small bulk offsets exist for most cases, have no idea how to mitigate this. Maybe we can just choose a model with both small loss and small bulk offsets (which we can calculate after training).
Repeat the training a bunch of times and evaluate + save each model. I never base decisions about failure on a single model instance (well, I don't do the anymore...)
Caveat: should deal with unphysical inputs, such as negative age, etc. The current NN emulator still produces a spectrum for unphysical inputs.
The $P(\theta)$ is parameterized by NF, so when the NF is not well-trained, if we sample $P(\theta)$, we might get some unphysical inputs. We wanna penalize the NF such that $P(\theta)$ has low power in unphysical regions. So I'm thinking that when the forward model gets an unphysical input, the output spectrum can just be [-99, -99, ..., -99].
I see. If the forward model bombs with some unphysical $\theta$, then this has to be sanitized. But you have to do it in such a way that the resulting spectrum looks very different from a physical one. It is not an option to not generate a spectrum for those $\theta$ values, otherwise the NF will never learn to avoid them.
@pmelchior I'm now trying to add redshift to the emulator. It would be very helpful if you could share your code on convolving the transmission curve with the spectrum 😃
A differentiable FSPS is now on the Earth: https://arxiv.org/pdf/2112.06830.pdf, which might be helpful for us Code: https://github.com/ArgonneCPAC/dsps
Nice! Although at a glance the package does not seem particularly well documented.
I also ran into this paper, which might be similar to what we're doing: https://arxiv.org/abs/2002.09491
I also ran into this paper, which might be similar to what we're doing: https://arxiv.org/abs/2002.09491
Yes! Recently I've been reading an ARAA paper on MCMC, and realized what we are doing is quick like the hierarchical Bayesian modeling, but parameterize the population parameters distribution using an NF (as you pointed out in this GW paper). Maybe this method could be powerful in the future.
@changhoonhahn I have a question about generating the training set for the forward-model emulator. I'm trying to expand the framework to include tage
, logtau
, dust (dust2
), and metallicity (logzsol
). I'm going to sample this parameter space on a 404040*40 grid and generate SED using FSPS. But it took forever and the memory just exploded. How did you generate training SEDs for your PROVBAGS work? Did you use anything like MPI?
@AstroJacobLi I didn't use a grid. Instead, I constructed my training sample by sampling the prior space extensively so I didn't need to use MPI.
In your case, are you compressing the SEDs in any way or is your emulator predicting the full SED?
@changhoonhahn My emulator predicts the PCA coeffs of the full spectrum. But anyway, I still need to generate a ton of SEDs to train the emulator. For sure, I can also turn the grid-sampling method to a prior-sampling thing. But I guess it still takes a long time to generate training SEDs. How long does it take to generate your training sample?
Just write down some updates on training emulator for the new SPS model:
SPS model: mostly from Chang's PROVABGS. I use the Illustris-based 4-component NMF SFH, and one burst component. I fix the metallicity history to a constant. The dust model is the same as PROVABGS (three params). In total, including redshift (i.e., t_age), there are 11 parameters. Because the four SFH coeffs are generated by sampling a flat Dirichlet distribution, the degree of freedom is actually 3. So in total, there are 10 params as input for the emulator. I generated 2.39 * 10^6 spectra, covering 1000 AA to 60,000 AA. The spectra are split into 5 chunks in wavelength ('.w1000_2000', '.w2000_3600', '.w3600_5500', '.w5500_7410', '.w7410_60000' )
, making it easier to train PCA and emulator.
PCA: PCA is trained for five wavelength chunks separately. The number of components for each chunk is [70, 50, 50, 50, 50]. The PCA recovery accuracy is generally very good (<0.2%) except for the very blue end (~1%).
Emulator: the architecture of the neuro net is similar to Chang's and Justin Alsing's. I used 4 FC layers with 256 neurons in each layer. It turned out that the batch_size in training is crucial for the speed of convergence. After several trials, batch_size=512
works best. The emulator for each wavelength chunk is trained for 500 epochs (~5 hours on Tiger using a single GPU). The fractional accuracy of the emulator (except for the very blue end) is now < 1%. However, for the blue end, the accuracy is ugly... Need to figure this out.
An effective and efficient forward model is required for population-level stellar population inference. Let's start with a toy model.
The simplest case is only varying stellar mass and current SFR, with everything else (including redshift
zred
, metallicitylogzsol
, age of forming starstage
) fixed. Currently, they are fixed atzred=0.05, logzsol=-0.5, tage=12.0 Gyr
, which means that the galaxy started to form stars at $z=6$. For SFH, I use the simple tau model here. I sampled the mass-tau plane using an even grid, which looks like the following figure on the mass-SFR plane. SFR is integrated over the past 50 Myr, which is consistent with most UV+NIR SFR probes. I think this setup is reasonable for our current purpose. What do you think @changhoonhahn ?Then, in principle, we can generate SEDs/spectra for each point on the mass-SFR plane, and train a differentiable emulator on them. The figure below shows SEDs (in SDSS filters) for 10 galaxies with the same stellar mass (5e10) but different
tau
(ranging from 0.1 Gyr to 10 Gyr). This process is not fast: the average time for 1 galaxy is 1 min on my laptop. Is there any way to accelerate this process? Cache the SSP or something?If you think generating the training sample in this way works, I can start to generate more SEDs and try to train an emulator.
I borrowed some code from Ben Johnson's
prospector
andexspect
to calculate SFRs and generate mock SEDs. The code is inpopsed/mock.py
.