LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

non-trivial galaxy morphology #32

Closed rmandelb closed 6 years ago

rmandelb commented 6 years ago

@joezuntz , @mdschneider , @rmjarvis , @esheldon , @EiffL (and feel free to loop anybody else in, I just pinged some people who participated in these discussions):

Just to close the loop on the question of galaxy morphology beyond Sersics in DC2: the Sersic models that are already in there would give non-elliptical galaxy isophotes already given that the bulge and disk components are not constrained to have the same shape. But it seemed from the discussion on slack a while ago that there was significant interest in having a flag that would tell ImSim to do something more complex (resulting in different morphology for PhoSim and ImSim runs). The "something more complex" could be something like this: some fixed fraction of the disk flux goes into a galsim.RandomWalk component rather than into the smooth disk component, with the same RNG in each band to ensure the clumps have the same location in each band.

Questions:

  1. Are there alternate ideas for how this could be implemented?

  2. If people want to implement it in this way, then we have to decide (a) what fixed fraction of the disk flux to put into the clumps, and (b) what SED to give to the clumps. For (a), it could also be a redshift-dependent fraction, I imagine. For (b), the path of least resistance is clump SED = disk SED, but that's also not very realistic. We could choose an "actively star-forming SED" for them, in which case the flux normalization need so to be defined in some bolometric sense (i.e., some fraction of total disk flux goes into clumps, and the effective fraction in each band differs depending on the disk vs. clump SED). Thoughts?

  3. Implementing this would require some updates in ImSim (which calls GalSim for image rendering). Is anybody in WL willing to volunteer their effort to doing that? If not, then we have to weigh this request against other ImSim modifications given the available effort.

rmjarvis commented 6 years ago

Here is an example of the kind of multi-component galaxy that I was talking about. We've been using this kind of thing in most of the DES sims lately.

The relative flux of the three components (bulge, disk, knots) are not constant. The knots and disk have the same ellipticity, but the bulge is different. We haven't been bothering with the centers being different, but that might also be worth including.

rmandelb commented 6 years ago

@rmjarvis - thanks for pointing this out. Thinking about how to implement this in the context of our simulation catalogs, which come with predefined disk and bulge fluxes and sizes, I think that if we want to otherwise emulate your scheme, we'd do the following:

Does that sound right?

And that yaml file you linked looks like it's from single-band sims, so we would still have to decide what to do about SEDs.

rmjarvis commented 6 years ago

Yeah, we haven't done much with colors yet in DES. We're starting to think about it for a batch of sims we're starting to develop for Y3 but haven't implemented much of it yet.

I think having a different SED for each component makes sense. I'm not sure how to get the normalizations right given an input catalog with the fluxes already given. I guess probably match the fluxes in one band and then let the SEDs dictate what that means for the other bands.

rmjarvis commented 6 years ago

Oh, and as to the prescription you laid out, the one edit I might make is that we kind of consider the knots to be part of the disk. So I'd probably leave the bulge at the original flux and only reduce the disk flux by smooth_frac. Then (1-smooth_frac) * disk_flux would be the flux in the knot component.

You also don't have to have the random number be [0,1]. You can let the smooth component always take at least say 30% of the flux or something. We just picked uniform in [0,1] because it was easy, but I don't know if that's the most physically motivated choice.

egawiser commented 6 years ago

Good point - a galaxy might look pretty unusual as just a bulge with a bunch of knots around it, so smooth_frac uniform over [0,0.7] seems more physical. In reality I bet the distribution is more like an exponential or gaussian with 0 most common and 1 very uncommon - is there a way to get a rough distribution from HSC or other surveys?

drphilmarshall commented 6 years ago

Apologies for my ignorance: are (or can) the properties of these knots (be) output in an instance catalog? This would provide a route to making PhoSim simulate the same clumpy galaxy.

rmjarvis commented 6 years ago

Sure. It's just a bunch of delta functions at a set of points, all of which have the same flux. This information can easily be accessed and written out. I'd assume this could be turned into something that PhoSim could use.

cwwalter commented 6 years ago

Is there a way to get a rough distribution from HSC or other surveys?

Yes, if would be nice if we could roughly tie this to some sort of observation so that we are not only adding complexity but we are also making it somewhat more realistic. Then we can also report this information when we describe the model we used.

rmandelb commented 6 years ago

@egawiser and @cwwalter - There is no direct and straightforward way to tie this RandomWalk profile to current observations. You'd have to do some statistical analysis of how you can model galaxy light profiles as sums of smooth components plus a bunch of DeltaFunctions. It would be a big research project to get a directly data-driven version of this. So, I think if we wanted totally data-driven non-parametric morphology we'd have to go more in the direction of HST-based generative models of galaxy morphology, like this: http://adsabs.harvard.edu/abs/2016arXiv160905796R

These are seriously awesome and Francois would like to put them into GalSim, but that won't help us for DC2. (DC3, on the other hand, ...)

So in short, I think the answer is that we have a clear way to do something that isn't completely insane but also isn't rigorously data-driven using RandomWalk objects, with some decisions to be made about SEDs. With some more work, we could get the RandomWalk info into the catalogs such that PhoSim could use it too (but it's not clear to me this is required). And what I am hearing is that the weak lensers think that this RandomWalk approach would be preferable to just using pure sersics (which are, of course, not rigorously data-driven descriptions of morphology either :). Do we have any people on this thread who would want to volunteer to implement this? We need some effort to go into this if it is going to happen...

cwwalter commented 6 years ago

So in short, I think the answer is that we have a clear way to do something that isn't completely insane but also isn't rigorously data-driven using RandomWalk objects, with some decisions to be made about SEDs. With some more work, we could get the RandomWalk info into the catalogs such that PhoSim could use it too (but it's not clear to me this is required). And what I am hearing is that the weak lensers think that this RandomWalk approach would be preferable to just using pure sersics (which are, of course, not rigorously data-driven descriptions of morphology either :). Do we have any people on this thread who would want to volunteer to implement this? We need some effort to go into this if it is going to happen...

So this is all implemented in GalSim, we need someone to add the calling interface from imSim right? There is a graduate student asking for an imSim related project but I want to understand the scope before I suggest this.

EiffL commented 6 years ago

To add to what @rmandelb was saying, I'm in a middle of building a simple ML model that would be able to generate realistic parameters for the RandomWalk model. I should be able to report whether it works or not by tomorrow. If it does, it would be data-driven way of building compound galaxies with bulge + disk + knots

rmandelb commented 6 years ago

Chris:

Yes, RandomWalks etc. are all in GalSim, so it would be a matter of coding up the prescription we've discussed on the ImSim side. And as Francois said, he took my statement that building a data-driven model for RandomWalks as a challenge to do the job in 1 day ;) so perhaps we can see what he comes up with, and then decide whether to go with the simple prescription discussed earlier in the thread or something more data-driven.

cwwalter commented 6 years ago

@EiffL Any update on

To add to what @rmandelb was saying, I'm in a middle of building a simple ML model that would be able to generate realistic parameters for the RandomWalk model. I should be able to report whether it works or not by tomorrow. If it does, it would be data-driven way of building compound galaxies with bulge + disk + knots

?

EiffL commented 6 years ago

Yes, although not as conclusive as what I would have hoped for. But here is the summary of my experiments last week:

The idea was that given such a model, for the DC2 simulations we could feed it the parameters of the bulge and disk and it would give us an appropriate distribution for the number of knots and flux ratio to make the image realistic.

As a by-product, the inference model that I trained also produces the posterior parameters of the generative model for given particular galaxy images. Here is an example of what this looks like in a "successful case": image From left to right, it's a (bulge+disk+knots) image drawn from the posterior, the original (bulge + disk) image, the original COSMOS galaxy image, the residuals between real image - (bulge+disk+knots), and residuals between real image - (bulge +disk). In this particular example, the quality of the fit is nicely improved, and this realization of the (bulge+disk+knots) model uses 35 knots and flux ratio of 0.26 between total disk and knots.

Here is a mildly successful case: image where some of the structure of the real galaxy is captured by the knots (here 45 knots and flux_ratio of 0.16)

And here is a case where it completely fails: image In this case the model defaults to a flux_ratio of 0.001 (the hard limit that I impose), meaning that no flux is allocated to the knots.

So, this illustrates that my inference model is not 100% successful, in general it doesn't decrease the fit to the actual galaxy but in many cases it doesn't increase it much either. The main limitation of this model is that it is based on approximate variational inference, using neural networks, it can be tricky to train and, as a result, quite approximate. I thought it would be enough to give us a rough idea of the RandomWalk parameters but maybe it isn't...

In the case of this last example however, I also suspect that the problem might come from the conservation of flux in the disk and limited maximum number of knots (100). Lowering the flux in the smooth component of the disk to allow for knots is going to decrease the quality of the fit more than the benefit of adding the knots...

This being said, it does learn a non-trivial prior for the number of knots and flux ratios. I parametrise the number of knots with a Binomial distribution, the blue histogram is the distribution of samples from the posterior on COSMOS images, and the orange histogram is the distribution of the means of the binomial distributions predicted by the conditional prior for galaxies in the COSMOS sample: image For the entire COSMOS sample, I get a mean number of knots of 41. However, I quite disappointed by the fact that the prior remains concentrated near this mean value, it seems to indicate that the bluge+disk fit parameters are not good predictors of the number of knots necessary to the fit...

As for the distribution of flux ratios, I parametrise the prior using a Gaussian distribution in log scale, here is what the distribution of prior means looks like, along with the distribution of flux ratio samples from the posteriors: image I apply a hard cut at flux_ratio = 1, which means 0 flux in the smooth disk. The prior distribution is centered around flux_ratio=0.14. Again, I'm a bit disappointed not to see more variations in the conditional prior for the flux ratio, it also seems to indicate that the bulge+disk fit parameters are not super good predictors for the flux ratio.

So, to wrap up, here are my conclusions/remarks on this attempt:

I hope this is not too confusing, let me know if there are questions/comments/suggestions. But if it sounds reasonable I can also write it up more cleanly and put the code on GitHub.

rmjarvis commented 6 years ago

That's very nice work. To your last question, at 100 knots, the knot drawing already dominates the two Sersic components, so going to 1000 would take fully 10x longer. Plus I suspect we don't usually care much about the realism difference at that point. At least for wl shears, 100 knots will be already complicated enough to stress test the shear algorithms, so I think we can just go with that.

rmandelb commented 6 years ago

Hi Francois -

In the case of this last example however, I also suspect that the problem might come from the conservation of flux in the disk and limited maximum number of knots (100). Lowering the flux in the smooth component of the disk to allow for knots is going to decrease the quality of the fit more than the benefit of adding the knots...

Right. I was wondering about this a bit. If the original fit was only to a smooth disk model, then is it possible this is not actually including all the flux of the galaxy, and the right answer is to let the knots add flux? In principle we could look at the difference between observed HST magnitude and fit magnitude, to see if there are signs that this is happening. Maybe you could deliberately choose some large galaxies with obvious knots of star formation, and look at this question for them? If it's a real problem, then that fit constraint is not the right one to use. I realize that relaxing that constraint may be opening a can of worms and turning this into a research project that is probably beyond the scope of DC2. We do need to converge on an approach for this problem fairly soon.

EiffL commented 6 years ago

Yes, so I went ahead and started fitting 60000 galaxies from the COSMOS sample with additional knots (up to 150) using a simple least squares and keeping the constraint that the flux from the disk should be preserved. Relaxing this constraint indeed opens a huge can of worms... It's not as fancy as the first method I was trying but should tell us all we need to know. It's been running all night and will take a couple more hours. Here is an example of what the fits look like: image

rmandelb commented 6 years ago

Well, that's very pretty. I'm curious how well it does in general; I guess you could do a 2D histogram of chi^2 for b+d fit versus chi^2 for b+d+k fit to get a sort of global picture?

EiffL commented 6 years ago

Ok, done. I have made an extended parametric catalog for 59904 galaxies from the cosmos sample, I have added the following columns: n_knots, disk_flux_ratio, knots_coords @rmandelb let me know how the fit_mad_* were computed and I can add a column for that as well. Here is the link to the fits table in case people are curious: https://cloud.orioncloud.fr/index.php/s/bYHNDzplSCvDIDM

The coords columns is in pixel coordinates along the major and minor axis of the disk, it would require a tiny bit of head scratching to put that in a format directly ingestable by the RandomWalk galsim model.

Here are a few plots for the fits: image

image

EiffL commented 6 years ago

I have looked at a few hundred instances, the fit is always improved compared to the b+d model, it's doing an amazing job for smaller and compact galaxies like this: image image image

but just OK when the galaxy is larger/more diffuse image image image image

Also, there is a significant number of times where the two component parametric fits are just very bad, that leads to even more corrections.

Note that these fits completely forget about the RandomWalk properties and just put point sources where they are needed to improve the fit. I think this explains in part why I get a lower average number of knots in these fits (25) compared to the previous number (41) found with the previous statistical model (which assumed a random walk prior). Still I think it provides interesting insight.

esheldon commented 6 years ago

At some point we had discussed letting RandomWalk follow a specified profile rather than a classical random walk, expecting that indeed it would work better. e.g. follow an exponential and use the photon shooting machinery

rmandelb commented 6 years ago

Your images look really nice to me.

However, I am concerned about the distribution of knot/(knot+disk) flux ratio, particularly the pileup near 1. I can think of a few explanations:

To test this, can you (a) make images of a random selection of those things, (b) make a scatter plot of knot/(knot+disk) flux ratio vs. bulge/total flux ratio?

Also, for these fits, do you find any correlation between disk parameters and number of knots or knot flux ratio? (which we could use to make the statistical model for knot distributions)

EiffL commented 6 years ago

Here are a few typical images of flux_ratio=1 objects (showing also the disk fit that has 0 flux in the b+d+k image): image image image

Looking at where these objects lie in the (disk flux, bulge flux) plane is quite interesting: image And here is the plot you requested of flux ratio vs. bulge/total flux ratio: image So essentially I think flux_ratio=1 objects correspond to galaxies with a fairly subdominant disk (compared to the bulge), and the flux from that component is better served using only knots.

But I don't think it indicates anything necessarily wrong with the fitter. Or at least, of all the examples I have looked at, I haven't seen any obvious failure, and there is no noise overfitting either.

As for making a statistical model from these fits, yes there are some trends with fit parameters. I can very easily fit a model to predict a number of knots and a flux ratio given bulge and disk parameters. My main worry is that these parameters don't correspond to a proper random walk model, the knots are more correlated in these fits to cosmos images than they would be from a realization of a RandomWalk model. I'll try to generate a few of those tomorrow to see what they look like.

rmandelb commented 6 years ago

Nice. I agree that if this is happening a lot in subdominant disks, it's not necessarily an indication of any failure.

EiffL commented 6 years ago

I'm testing to see what happens if I randomly draw the knots (keeping the flux ratio and numbers directly taken from the fits) according to a Gaussian distribution matching the size and ellipticity of the disk. Here are 2 sets of images, the first one using the fitted knots positions and the other using a random distribution to position them. image image Without noise the same images look like this: image image

This simple prescription to draw the knots is not completely perfect, I kind of see the difference between fitted and randomly drawn knots, but it's not completely crazy either, it even looks pretty nice.

So, if we adopt this simple prescription, saying that we will randomly distribute the knots according to a Gaussian of same shape as the disk, the last piece of the puzzle is to fit a simple model that will predict the number of knots and flux ratios given disk and bulge parameters, that I can easily do.

The resulting knots model will be data driven, but making a bunch of simplifications and also relying of the COSMOS fits, which are a bit wonky at times. @rmandelb @rmjarvis @esheldon how do we decide that it's good enough ?

rmjarvis commented 6 years ago

how do we decide that it's good enough ?

These already look great to me. Nice work on this, Francois. I think anything that gets the flux ratios roughly in the right ball park will suit our needs just fine for DC2.

esheldon commented 6 years ago

Note that RandomWalk assumes the high N limit which is a gaussian, so this is exactly equivalent.

On 11/15/17, Francois Lanusse notifications@github.com wrote:

I'm testing to see what happens if I randomly draw the knots (keeping the flux ratio and numbers directly taken from the fits) according to a Gaussian distribution matching the size and ellipticity of the disk. Here are 2 sets of images, the first one using the fitted knots positions and the other using a random distribution to position them. image image Without noise the same images look like this: image image

This simple prescription to draw the knots is not completely perfect, I kind of see the difference between fitted and randomly drawn knots, but it's not completely crazy either, it even looks pretty nice.

So, if we adopt this simple prescription, saying that we will randomly distribute the knots according to a Gaussian of same shape as the disk, the last piece of the puzzle is to fit a simple model that will predict the number of knots and flux ratios given disk and bulge parameters, that I can easily do.

The resulting knots model will be data driven, but making a bunch of simplifications and also relying of the COSMOS fits, which are a bit wonky at times. @rmandelb @rmjarvis @esheldon how do we decide that it's good enough ?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/LSSTDESC/DC2_Repo/issues/32#issuecomment-344810422

-- Erin Scott Sheldon Brookhaven National Laboratory erin dot sheldon at gmail dot com

EiffL commented 6 years ago

@rmandelb I got the last piece of a model running last week, that is predicting a flux ratio and number of knots for a given bulge+disk flux. It uses a Gaussian mixture model to output a pdf which we can then sample from.

So here the full summary of the proposed method:

The most important approximation here is that we would be using a RandomWalk model with parameters obtained from the COSMOS fits, which do not assume a random walk but simply putting knots wherever they are need to improve the fits. This should not be too bad though.

Then, the question is, at what stage do we want to get the random walk parameters and then the exact position of the knots for a given galaxy?

Finally, what is the process to check with all the affected groups to see if they are fine with this proposal ?

rmandelb commented 6 years ago

but then these wouldn't appear in the Catsim outputs and not directly reproducible (unless the random seed is kept).

We would need to keep the random seed anyway to ensure consistency across bands. And speaking of that, we also need to decide on a knot SED. If we think these knots typically represent star formation regions, they should get a star-forming SED (more strongly star-forming than the disk), while if we think they represent morphological disturbances in disks, they could get the same SED as the disk. Realistically we have both scenarios, I am sure, so we should just pick one of these and go with it.

My impression is that multiple people were OK with the GalSim and PhoSim sims having different morphologies (e.g., @rmjarvis @cwwalter ). So my suggestion is we don't try to put them in catsim. I'm not sure if we want to put this extension in GalSim or in ImSim, though... thoughts? @cwwalter ?

Finally, what is the process to check with all the affected groups to see if they are fine with this proposal ?

I think the process is we write in the DC2 planning document that we intend to do this, circulate the document to the working groups, and see if anybody screams. Once we have definitely converged, I can volunteer to write something about it in the plan.

cwwalter commented 6 years ago

My impression is that multiple people were OK with the GalSim and PhoSim sims having different morphologies (e.g., @rmjarvis @cwwalter ). So my suggestion is we don't try to put them in catsim. I'm not sure if we want to put this extension in GalSim or in ImSim, though... thoughts? @cwwalter ?

Yes, I would like to have this implemented on the imSim side. I agree that it doesn't need to be in CatSim. It will be more of a "realistic morphology toggle" with the 1st choice available to us your work (you could imagine others like GANs in the future). We will need to make it reproducible.

I think probably this should be a GalSim feature that we could call from a API where we set the parameters we need and call them from imSim. If we need to do more complicated fitting training etc and it is more appropriate for that to be in imSim; that is fine too but I don't really know enough about the details to make a coherent suggestion for this part.

In the short term, as Rachel says we want to let the working groups now this is what we are thinking and make sure it is OK.

cwwalter commented 6 years ago

Argh.. bad typo above:

Should be:

I agree that it doesn't need to be in CatSim.

will edit.

cwwalter commented 6 years ago

@EiffL can you give us a status update on this? We need to decide whether we will include this for DC2. Can it be ready, and if so are you willing/interested in doing the work or would we need to find someone else? Thanks!

EiffL commented 6 years ago

Yes, the proof of concept seemed conclusive. I can easily add this as an option into GalSim (~1-2 days worth of work). It can definitely be done for DC2 and I'm happy to implement and test it. I just had a look at the DC2 planning document, I don't see any specific plans for the galaxy models to use.

cwwalter commented 6 years ago

Great! I've assigned you.

Well we get a set of sersics for the bulge and disks and associated SEDs from the in-painting process. That is the only "specific plans for the galaxy models". You work for the imSim simulation would make the morphology of those galaxies more realistic. Or did I not understand what you are asking?

EiffL commented 6 years ago

Sorry, I didn't actually phrase my thoughts as a question. I meant, I don't see a specific section on galaxy models used for image simulation in the planning document, do we want one for Friday ?

cwwalter commented 6 years ago

Oh, I see.. No it doesn't have to be done by then. I will add this to the table. If you start something short that explains what you plan to do that is enough for now. Then, you can fill it out later with the details. It would be good if you are there on Friday to answer questions about it though.

EiffL commented 6 years ago

Ok great, no problem.

On Dec 13, 2017 4:17 PM, "Chris Walter" notifications@github.com wrote:

Oh, I see.. No it doesn't have to be done by then. I will add this to the table. If you start something short that explains what you plan to do that is enough for now. Then, you can fill it out later with the details. It would be good if you are there on Friday to answer questions about it though.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/32#issuecomment-351527676, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0ll-iYzzAcFuJZkjODR_u14dLGbkGQks5tAD7qgaJpZM4QTYtY .

EiffL commented 6 years ago

Here is an update on the status of this issue.

The only thing that remains is the ImSim parsing function that will actually draw the RandomWalk from the instance catalog.

cwwalter commented 6 years ago

The only thing that remains is the ImSim parsing function that will actually draw the RandomWalk from the instance catalog.

FYI the parsing code is here:

https://github.com/LSSTDESC/imSim/blob/master/python/desc/imsim/imSim.py

EiffL commented 6 years ago

Yes, thanks, should have it working tonight.

On Jan 17, 2018 10:47 AM, "Chris Walter" notifications@github.com wrote:

The only thing that remains is the ImSim parsing function that will actually draw the RandomWalk from the instance catalog.

FYI the parsing code is here:

https://github.com/LSSTDESC/imSim/blob/master/python/desc/imsim/imSim.py

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/DC2_Repo/issues/32#issuecomment-358403241, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0ll8UWOVxPzQ2io67ul1YbS6SWbMIpks5tLkBMgaJpZM4QTYtY .

EiffL commented 6 years ago

So, the whole pipeline now works, I'm still testing but I have preemptively opened pull requests in all the affected software components, including imSim. Here is the how the whole process is structured:

@jchiang87 @danielsf Let me know if you have questions/concerns and what tests you want me to do on these different pieces before integration

katrinheitmann commented 6 years ago

@EiffL Do you have a write-up on this somewhere? DESC Note? It looks very complete so should be capture somewhere where it can be read easier than in Github. Thanks very much!

EiffL commented 6 years ago

@katrinheitmann Yes, I'm writing up a DESC note about it, just let me know where it should be hosted/linked to. Thanks !

cwwalter commented 6 years ago

With https://github.com/LSSTDESC/imSim/pull/89 and the resulting addition of random walk knots to imSim and the CatSim interfaces we successfully added complex morphologies to DC2. Nice job @EiffL and all!