Open yymao opened 6 years ago
No problem! We are still in the middle of analysis, and still cleaning up stuff, but here are some initial plots. Here, we are looking at protoDC2 and buzzard_test for local samples (z< 0.2) and restricted to more luminous than M_r=-16 (which is where protocDC2 seems to cut off).
Histograms of the rest-frame SDSS g-r (protoDC2) and z=0.1(?) DES g-r (buzzard_test) galaxies in the local (z<0.2), M_r<-16 samples:
Rest-frame SDSS g-r vs. M_r (protoDC2) and z=0.1(?) DES g-r vs. M_r (buzzard_test) color-magnitude diagrams for galaxies in the local (z<0.2), M_r<-16 samples:
Same as in Item 2, but as a 2d histogram:
Same as Item 3, but with better density mapping (still under construction).
Although the buzzard_test "rest-frame" mags/colors appear to be actually for redshift z=0.1 and in the DES system instead of z=0.0 in the SDSS system (like the protoDC2 mags/colors), I think that mainly should cause small overall shifts and small changes in slopes in the plotted relations. I don't think this would cause large shifts or large changes in the overall morphology of the plots.
The situation looks a lot better. Looks like the protoDC2 color-mag distribution could still be improved, but the situation is not nearly as scary as the original plots initially suggested.
@erykoff and I think that this needs to be made a required validation in order to enable the planned CL WG analyses on DC2. Our thinking right now is that the best way to do this is to run redmapper on protoDC2 and compare the mean red sequence and scatter to what is found in either DES or SDSS.
The mean is not incredibly important, so long as there is a well defined red sequence, but the scatter will dictate the photo-z performance of the cluster finders and we want this to be as close to what is observed as possible.
These plots are taken from the overhauled DC2 catalog I have been working on. Not strictly speaking color-magnitude diagrams, but very closely related. Figures show restframe colors at z=0.1 compared to volume-limited SDSS samples. I can post color-magnitude and/or color-color diagrams if people are interested, though those look comparable.
@aphearin - thanks for these updated plots. For your comparison, has DC2 also been cut to a volume-limited sample using similar cuts as how the SDSS volume-limited samples were defined? Or when you said z=0.1, did you mean you're using a snapshot?
Might you be able to show 2D histograms instead of scatter plots? I find scatter plots can be a bit deceptive in terms of where is the maximum density, once you have a large number of objects.
I agree that this looks quite promising. As a way of closing the loop, I would be curious to see the color vs. luminosity plots that @DouglasLeeTucker was making before. The particular question I am interested in is whether (a) the volume-limited samples you've used are bright enough that some of the funny features at low luminosity in his plots are not visible, or (b) those funny features have actually been fixed. Or perhaps there is an option (c) those funny features at faint luminosity don't matter for our science applications? (I could imagine that might be true for cluster-finding, since redmapper relies on a color-luminosity relation for objects above a fixed fraction of L*, but if we're defining a clustering sample down to i=25.3 even at low redshift, it might care about those features...)
@rmandelb - these plots were generated from a DC2 snapshot catalog. I have not yet run the model on all the snapshots to regenerate a lightcone mock.
I have a few more scatter plots to show that are straight color-magnitude diagrams. Then I will show four-panel 1-d histograms at the end, which are easier to make and I also think easier to read.
(For convenient referencing, see https://github.com/LSSTDESC/descqa/issues/63 for plots of DC2 two-point functions that complement the conditional one-point distributions plotted here)
The particular question I am interested in is whether (a) the volume-limited samples you've used are bright enough that some of the funny features at low luminosity in his plots are not visible, or (b) those funny features have actually been fixed. Or perhaps there is an option (c) those funny features at faint luminosity don't matter for our science applications? (I could imagine that might be true for cluster-finding, since redmapper relies on a color-luminosity relation for objects above a fixed fraction of L*, but if we're defining a clustering sample down to i=25.3 even at low redshift, it might care about those features...)
I am sure that the SDSS samples I am using in these validation plots do not go to faint enough magnitudes to make the plots you would like to see. The lack of reliable calibration data down to the magnitude limit that you are asking for (along with @janewman-pitt-edu in this comment ) coupled with the demands on simulations to resolve halos down to those magnitudes, places this problem into the category of R&D. I am doing all I can to "reasonably extrapolate" the model into regimes where there is limited (and/or incomplete/unreliable) data, and into regimes beyond the resolution limits of currently available simulations of cosmological volume.
I can easily imagine continuing to improve upon the modeling in the regimes you are asking about, but that would likely require new simulations and also a more or less dedicated FTE working on this problem. Perhaps we should discuss this in the context of looking forward towards DC3 at the collaboration meeting. Tagging @drphilmarshall by way of proposing such a discussion
I can easily imagine continuing to improve upon the modeling in the regimes you are asking about, but that would likely require new simulations and also a more or less full-time FTE working on this problem. Perhaps we should discuss this in the context of looking forward towards DC3 at the collaboration meeting. Tagging @drphilmarshall https://github.com/drphilmarshall by way of proposing such a discussion. I think that sounds sensible. The Thursday afternoon session on Extragalactic Catalog Validation looks like the best place for it, although the current schedule looks pretty focused and it's the call of the session leaders (Eve and Simon). In the process of assessing in that session whether our efforts to prepare for DC2 Run 2.0 are likely to be "good enough," I think it will be natural to identify goals for DC3.
Sorry, somehow I missed this thread before... looks like it started when I was traveling.
What's special about -19 < M_r < -20 and r-i that makes it look so much worse than the other histograms (though g-r going redder than the real galaxies at the bright end is also an issue, it means the SED gamut won't be right)?
If anyone is still trying to compare SDSS-band to DES-band colors, please at least apply the (known) transformations between those systems. Those will cause not just shifts but also stretches to the color distributions.
Hey @janewman-pitt-edu , I'm not sure what is going on there, but I can say that it will be difficult to improve upon this level of agreement without significant further R&D.
@janewman-pitt-edu - Right now all I can show are SDSS band colors, but part of the pipeline we are overhauling involves matching these color distributions to a Galacticus galaxy with similar colors, and using that Galacticus galaxy's spectrum to generate fluxes in other filters. So there will not be any need to apply fitting function transformations to the fluxes, since they will be computed self-consistently in all bands with the model spectrum.
@drphilmarshall @aphearin The Extra-Galactic Validation session is quite full already and there will be a lot of issues to discuss. We have a CS technical session at 8am on Thursday and a session on "Post-processing models" at 2 pm on Wed. Either of those would have more time for a discussion of "DC3 Goals and Planning". Do either of you have a preference.
@drphilmarshall - I had not necessarily intended to suggest that the talk schedule be reworked to accommodate this as a dedicated discussion, which could easily happen "offline".
Got it, thanks Andrew. Still, DC3 is going to come around fast, so it's not at all crazy to spend some closing minutes in a session on punting things to DC3 and writing them down. Eve, I don't have a preference for where this should go - maybe the technical session makes more sense, given the content? But it's up to you.
Agenda item has been added to T-CS (Thurs 8 am)
It would be good if we can converge on this color-magnitude diagram test. I believe that PZ is likely to be the primary driver for this one (particularly now that the cluster red sequence test has been split off into a separate issue #41 ).
So far we have seen two approaches:
If we were to use approach (1), we'd need a validation dataset and criterion; if we use (2), we'd just need the latter. In both cases, we'd need software to be incorporated into the DESCQA framework, which as I understand it is not yet the case for the code used to produce either sets of plots. I hope somebody will correct me if I'm wrong.
Since @janewman-pitt-edu commented on this thread earlier, I wonder if he might weigh in on the points in this comment.
I think even (2) is problematic right now: it really needs to be a comparison of equivalent apparent-magnitude-limited or volume-limited (i.e. with a absolute magnitude limit where SDSS is 100% complete for all colors and the whole redshift range) samples to make sense. Otherwise the relative weightings of red vs. blue galaxies as a function of absolute magnitude will be thrown off by the r band cut; comparing the real SDSS sample to a mock slice shouldn't agree).
If we did have properly-implemented equivalent samples, I'd favor approach (2) over approach (1) as we don't need to figure out statistics and criteria for a 2D test in approach (2). However, I also think that if the color distribution tests in DESCQA2 are matched, a color-magnitude test would almost certainly be too; and vice versa. So this doesn't seem a high priority to make a DESCQA test. That said, plots like @DouglasLeeTucker 's are very informative about what the semi-analytics are doing and we should keep making them!
Just a technical side note: we can certainly have tests that generate plots but do not have specific criteria, as long as the test is not required by a specific scientific purpose.
Then having the CMD plots would be great :)
Another advantage of (2) is that it allows much more target/focused testing for catalog producers. For example, it is possible that the r-band LF is incorrect while Prob(g-r | Mr) is correctly captured by the model. That is quite useful information for a catalog producer. The formulation of test (2) could be done such that the g-r comparison is only done on matched distributions of Mr. A separate test could enforce the Mr LF. Getting both right simultaneously is mathematically equivalent to the CMD test, but doing things this way creates more clarity for which aspect of the joint distribution may be failing.
@janewman-pitt-edu - is there a reason why you think we do not have properly implemented volume-limited SDSS samples? It is straightforward to calculate the completeness redshift for the SDSS Main Galaxy Sample from the petrosian r-band limit of r<17.7 (in fact, that is exactly what I did when making the plots higher up in this thread: I first selected only those SDSS galaxies passing an r-band completeness cut, then I distribution-matched the mock r-band LF to that in the SDSS sample, then I compared the g-r distributions, giving what I think is a pure test of the g-r distribution in the mock, properly controlling for/scaling away any potential shortcoming of the r-band LF).
They're very clearly not volume-limited. A volume-limited sample would have a hard cut (maximum) in M_r, and be dominated by number by objects near that faint cut. Note that to design a volume-limited sample for redshifts z1 < z < z2, you want to set the absolute magnitude limit such that objects of all colors are included at that limit in a z > z2 sample (as that limit corresponds to objects that will be included at any redshift z < z2). E.g., if you're looking at an envelope of points, you want to look at that envelope for the sample at higher redshifts than where you are going to use (and make sure it works at both the bluest and reddest ends of the color distribution).
I am referring to the purple vs. green histograms, in bins of r-band labeled at the top of each panel. I did exactly as you just described to make these plots. How can you see from the purple and green histograms that these distributions are clearly not volume limited? I do not follow.
Those don't tell you about if it's volume-limited (but if one sample effectively is and the other effectively isn't then that alone would cause differences). Absolute magnitude doesn't get plotted in them, after all... I'm looking at the 2D plots of color vs. absolute magnitude.
It's true that the purple vs. green histograms don't self-express that they were made using volume-limited samples. But they were indeed made using volume-limited SDSS samples, again, using the exact prescription that you described.
So the 2D plots are not the same samples? They don't look volume-limited (for the reasons I mentioned).
Ahh, I see the confusion then, which comes from my not having provided sufficient information about how I generated the plots (which were just "status updates", after all). Nope, those are different samples, and you are correct that they are not volume-limited. The 2-d scatter plots with the blue points were selected/matched based on stellar mass, not for any particularly good reason, other than that is the native quantity of the model.
Ah, that might explain why the blow up in SDSS r-i colors for one abs. mag. bin doesn't show up so clearly in the 2D plots...
It's good we clarified what was being plotted, since that was a source of confusion that I had forgotten to fill in. Now that that's clear, maybe we should return to the question posed by @rmandelb that got us here originally.
So far we have seen two approaches:
- The plots that @DouglasLeeTucker shared, which were traditional color-magnitude diagrams (no validation data, just looking at the sims).
- The plots that @aphearin shared, which are 1D histograms of colors in bins of fixed luminosity, compared with SDSS as a validation dataset.
Since Prob(A, B) = Prob(A | B)*Prob(B), then there is mathematically no distinction between the two, it's just a matter of what is easier to implement, and what is easier to interpret. I still advocate for (2) because of how easy and clear it is to see failure modes in 1-d histograms of conditional distributions, as opposed to 2-d scatter plots. If end-users want CMDs to look at, of course that's fine too and there's nothing wrong with including that, but as a catalog producer I am less likely to use those validation tests to refine my modeling, instead favoring LF + conditional PDFs validations to tune the model, since I find it (much) easier to quickly identify model shortcomings that way.
I had already voiced an opinion on this but maybe wasn't clear -- we'd need to define new statistics and methodologies for (1), whereas (2) is a close analog of the color test already implemented. This makes (2) a lot easier. The (1) plots are very informative though so having them displayed but not associated with a test makes a lot of sense to me.
OK - so far what I am hearing is:
What we need are:
@janewman-pitt-edu - it would be quite helpful to me if you or someone in the photo-z group could implement the behavior of the following function:
def sdss_rband_completeness_redshift(sdss_absolute_magr_low, sdss_absolute_magr_high):
return zlow, zhigh
The two arguments are Absolute petrosian r-band magnitude limits in SDSS Main Galaxy Sample DR7. This function plays a central role in how DC2 colors are assigned, and I am pretty sure the same is true for @j-dr with the Buzzard catalogs. My current method for this are eyeball-based. I am entirely comfortable with my eyeball-based methods for code development purposes, but I think this is inadequate for production purposes, and we are very near the point of taking the overhauled protoDC2 to production.
Since SDSS colors serve as the low-redshift anchor of the full color distribution, and because the data used for SDSS colors plays a critical role in both Buzzard and DC2, I think it would be much better to arrive at a quantitative agreement on this specific function as a collaboration. Agreeing upon that is also a pre-requisite to the quantitative validation criteria requested by @rmandelb. Once we are agreed upon the behavior of this function, it makes it quite straightforward for me and @yymao to implement rigorous validation criteria in DESCQA. Without an agreement upon this, ambiguity will remain.
Deriving those quantities accurately is pretty non-trivial as you'd need to kcorrect the rest-frame colors to every possible z to do it right (calculating the limits for each galaxy and then using the most extreme values to set the overall limits). I also don't think the results would be so different from doing the cuts by eye so long as you err on the side of being conservative (and make sure to apply the same absolute magnitude and redshift cuts to observations & simulation).
If you did want to implement such a function, the simplest approximation I can think of is to linearly interpolate between the z=0 and z=0.1 passband absolute magnitudes from NYU VAGC. One can then determine at what redshifts you intersect the magnitude limit of the sample for a given SED (applying both the distance modulus and kcorrection terms, where you use the interpolation to give you the kcorrection piece). Next easiest (but certainly slower) would be to actually use kcorrect.
However, I fail to see why this would be worthwhile. I'd much prefer to do the selections in observed space and just do a matched selection from the catalog vs. the data. So long as the selections match, it doesn't matter if you are volume-limited or not... so long as we are dealing with light cones (and not slices) that's fine.
If you do want a test on a slice, I'd look at what @rongpu implemented in DESCQA1 and just do the same thing. We looked at things pretty closely when defining that.
i.e.: For a color test, you just need to make sure it's an apples-to-apples comparisons. They could be any of a variety of apples (volume-limited, 1/Vmax weighted, simple magnitude and redshift cuts, or whatever) so long as both apples are of the same type (and both datasets are complete enough to do the chosen comparison). You might as well just choose some simple cuts that you are confident will be complete enough in both simulations & data and do the comparison for them. We have well-developed statistical approaches for this from DESCQA1.
I think one thing we could think a little more about is just how we label different levels of disagreement between the color distributions. I would be shocked if any simulations achieved a level of agreement that would pass a test for being statistically indistinguishable; instead, we should probably define thresholds for "OK", "good", and "excellent" agreement (for instance).
I would be shocked if any simulations achieved a level of agreement that would pass a test for being statistically indistinguishable
I would be too.
I would like a goal-driven validation criterion that starts with our science goals (what analyses do we plan to do for DC2, in order to learn various things about our analysis pipelines?). Given those goals, what level of agreement is needed between simulations and data?
For example, when defining validation criteria on the N(<limiting mag), we use our goal of having roughly realistic levels of blending. This is likely achievable as long as the number density is within a few tens of % of the real one.
As I recall, for DESCQA1, we defined things in terms of the RMS distance of the CDF of colors in the simulation from the true (data) CDF (this is related to the Cramer-von Mises statistic for differences between distributions). The goal given in the DESCQA1 paper was 0.05, but that's a somewhat arbitrary number (chosen in part by what seems achievable in the near term). I'd say excellent agreement would be a value < 0.01, good agreement < 0.05, fair < 0.1 . I see no reason to do anything differently when looking at color in bins of absolute magnitude than when looking at the overall color distribution.
If you did want to implement such a function, the simplest approximation I can think of is to linearly interpolate between the z=0 and z=0.1 passband absolute magnitudes from NYU VAGC.
Ok @janewman-pitt-edu, this is actually very similar to what I am already doing, so if you and @rmandelb are happy with eyeball-level estimates of SDSS completeness, then I am too.
However, I fail to see why this would be worthwhile. I'd much prefer to do the selections in observed space and just do a matched selection from the catalog vs. the data. So long as the selections match, it doesn't matter if you are volume-limited or not... so long as we are dealing with light cones (and not slices) that's fine.
This highlights a difference between end-user tests of a final catalog vs. tests for catalogs that are in-production. Since the theoretically natural way to build models of the galaxy--halo connection is using snapshots and absolute magnitudes, then tests at fixed redshift compared to absolute magnitude distributions are (much) more useful during development stages of such models. If the only tests that we want in DESCQA are based on lightcones with apparent magnitude comparisons, that's fine by me, but from a catalog-producer standpoint such tests provide less targeted information about how to improve a model.
Using my own tests based on volume-limited SDSS subsamples, and after getting this critique on the level of agreement shown previously, here are results based on an update to the overhauled protoDC2 colors.
I can show more plots at the meeting. @dkorytov, @evevkovacs and I are working hard to produce a full lightcone catalog with all the protoDC2 properties, with a goal of running it through DESCQA and making it available to DESC this week.
We can look at the statistics of the distribution using the DESCQA tools, I imagine, but those are looking like a closer match than Buzzard was (which should have matched colors by construction, I would have thought, but didn't quite).
Hi all, just joined this DESCQA effort, thanks to @evevkovacs @duncandc for your help. Posted this here as it is related to color distributions. As an exercise for myself during the hack day, I implemented within the descqa class framework a trivial g-r vs r-i color distribution 'plotter' and simple photo-z calculation using scikit-learn using protoDC2 data. I believe even though photo-zs can be deceiving they can point towards potential features of the simulations as @morriscb showed during the meeting. I think this might overlap previous/ongoing efforts so I don't want to step on anyone's toes! So haven't PR'ed these yet until I find out what is useful and what is not. Example plots attached:
@nsevilla Thanks very much, I think that looks good. Have you tried running on another catalog besides protoDC2 (eg there is a Buzzard test catalog that would be good to try)?
Not yet @evevkovacs but will do.
@nsevilla That certainly looks related to Peter Freeman's results. I'm assuming here you ran the photo-z with no magnitude errors applied to the simulated photometry (which is what led to the strongest signatures of the simulation slices)?
Indeed @janewman-pitt-edu , as I found out during the implementation, @morriscb had already shown a similar thing (more sophisticated photo-z code). This is just me trying to get it implemented in descqa, though not terribly original. Mags I believe do not have implemented any errors at this stage, they are mag_{ugriz}_lsst, but not sure.
@nsevilla That's correct, there are no mag errors.
@aphearin @janewman-pitt-edu @rmandelb @rongpu @nsevilla @morriscb
OK this is another thread that we have lots of discussion and haven't reached a concrete plan. Let's try again:
I think we all agree that color-magnitude diagrams are nice to see, but not easy to compare with validation data sets. So we'll keep this issue, but remove "need validation data" and add "not required".
We need a color distribution test in fix magnitude bins. I think the discussion of that test should go to #15.
There's a new thread on color-color diagram and photo-z. @nsevilla @morriscb is that a test you think would be important for PZ? If so, I feel that we need a different issue for it.
Color-color would be nice to match (there is at least empirical data for that) but it's not easy to define a good metric for comparing 2D distributions, so I would hold off on requiring that. The photo-z test doesn't really have a quantitative QA goal, but plotting it is a useful way of seeing if the k corrections to redshifts in between those of the slices are working right.
During the Sprint Week, @DouglasLeeTucker and @saharallam have made progress on creating color magnitude diagrams for GCR Catalogs (protoDC2 and Buzzard).
@DouglasLeeTucker and @saharallam, can you share some plots you made with us here, and then we can continue to make a validation test from what you have done?