Open AstroJacobLi opened 2 years ago
It seems this initialization is more like a flat "prior" in a Bayesian sense. Then I train the NDE without penalty (we don't need that anymore). The training looks okay. After combining 20 NDEs, this is what I get (shitty): In one word, very poor constraints on all other parameters except for dust2, redshift, and stellar mass. If we believe this is true, then the good results I showed last week are largely due to the "implicit non-flat prior" which comes from the way we initialize NDEs.
Let me guess: when you sample from these posteriors, the photometry still matches well with observations?
Let me guess: when you sample from these posteriors, the photometry still matches well with observations?
True, matches with observation quite well.
Is the loss any higher than with the previous flow initialization?
I think I found a way to make the NDE training better. I tried to tune various hyper-parameters, and found the most important one is the blur
in calculating Wasserstein loss. If I understand correctly, this parameter controls how much Wasserstein distance is sensitive to the fine structures of the data. A small blur
(e.g., 1e-3) makes the loss function very sensitive to the detailed structures of the two distributions. In the past, I set blur=0.1
, which seems not sufficient to provide the optimizer enough gradient. Now I train NDEs using blur=1e-3
. The results are shown below. Compared with previous results, we have better constraints on stellar mass, redshift, metallicity, dust optical depth, and dust attenuation slope. Without a strong prior, the photometric data still lack constraining power on SFH.
nice, this does look a lot better.
Another good sign: with blur=1e-3
, the results do not show a strong dependence on the architecture of neural nets.
We have discussed for a while using the variable transformation trick to avoid using penalty functions for unphysical regions. I did a few tests in this direction.
Assume we have an RV $X$ and its support is $[a, b]$. If we believe that our prior on $X$ is a tophat, then we can transform $X$ to a new RV $Y = \Phi^{-1}((X-a)/(b-a))$ (where $\Phi$ is the CDF of normal distribution) which follows a standard normal distribution.
With this said, all SPS parameters in our case have supports in the form of $[a,b]$. And we want our prior to be a tophat. Then we can basically initialize Gaussians in the $Y$ space (we can only do this because normalizing flow starts with Gaussians), then transform it back and get a flat prior. The range [a, b] is also relatively easy to determine, basically setting the range to the emulator range is okay.
This figure shows the initialization scheme as described above. The initial distribution is flat (shown in noisy blue contours).
Before transformation (just standard normal distributions)
After transformation (standard normals become tophats)