Clarifications on data properties

AngusWright commented 4 years ago

I've just started having a play with the tomo_challenge data, and I'm curious about some of the quirks of the dataset. To this end, I've included here a bunch of observations & questions. I understand that the challenge is designed to be "idealised", however it occurs to me that many of these quirks might work to favour machine learning codes, for example, and disadvantage other methods (I give examples of what I mean).

My main observations are below. Apologies in advance for the long list (and for any silly observations...). I've also included a handful of figures to show what I mean. In each figure I have included data from both the (noisy) tomo_challenge dataset and from (noiseless!) MICE2 dataset, for comparison. Full disclosure that I am using MICE2 as a benchmark for 'what I roughly expect to see'.

The colour-redshift relation shown by individual SED tracks appears chaotic. The redshift axis is clearly discretised (due to snapshots in the simulation?), which is not an issue per-se, but between adjacent snapshots the colours appear uncorrelated. Colour-redshift tracks abruptly end and restart. This behaviour is mirrored exactly in both the training and validation datasets, so may not be overly problematic for machine learning classifiers, but such behaviour could unfairly hamper template-based or hybrid approaches that (i would say fairly) assume coherent colour-redshift evolution?
The template discrimination (i.e. the ease of discerning underlying templates in colour-redshift space) appears to become better as a function of redshift, rather than worse. At a given magnitude, the colour-redshift space becomes increasingly discretised as a function of redshift.
In the above figure, you can also see that the range of galaxy colours in the tomo_challenge data is considerably smaller than in MICE2, especially in the region 0.3≤z≤0.7 where lots of blue sources are missing. Is this because the tomo_challenge templates are missing low-mass (blue) sources? This could be important prior information for bayesian codes?
There appear to be small gaps in redshift at low-z, in between (what I'm guessing are) snapshots of the simulation. While these aren't themselves a problem, they may be symptomatic of an underlying bug.
In the above figures, you can also see that the tomo_challenge data appear to be lacking any large-scale structure? This would bias against hybrid approaches that wish to invoke cross-correlation.
The noisy tomo_challenge data extend to much higher redshift, at a given magnitude, than the sources in MICE2. Not sure if this is a problem (?), but mostly just an observation. If could hamper non-machine-learning codes invoke a data-driven redshift-magnitude prior.

EiffL commented 4 years ago

Hummmm very good points! So I know our DC2 photometry is not perfect, because we had to do a complicated matching between the base DC2 simulation and a separate galacticus run. This is somewhat explained in the simulation paper here: https://arxiv.org/pdf/1907.06530.pdf section ~5.3 But it's fair to say we expect some amount of weirdness ^^' For instance, I thiink the discontinuities you are seeing in your first plot are coming from switching from one discrete galacticus snapshot to the next. This being said, I hadn't compared them to Mice, nor did I expect the tracks to be that clear at higher redshift. I'm gonna tag @katrinheitmann and @aphearin on this issue who can probably comment more on this.
Concerning the range of galaxy colous/lack of blue, I don't know, but are you applying similar cuts in the MICE data as what we used to build the sample here? See for instance: https://github.com/LSSTDESC/tomo_challenge/blob/25af0db3639113e09b55b677f42111244f56a467/bin/generate-challenge-data.py#L19 We have cuts on SNR and size, and for instance I'm not 100% sure off the top of my head how the metacal SNR is computed here, but @joezuntz would know.
For cross-correlation, positions are not meant to be included by default in the available data, but LSS should be there in the input DC2 simulation.
As for how this impacts the challenge :-| this is a good question. I think we more or less assumed no external/prior knowledge, and yeah this:
```
such behaviour could unfairly hamper template-based or hybrid approaches that (i would say fairly) assume coherent colour-redshift evolution?
```
might be a limitation...

joezuntz commented 4 years ago

I've checked the original data that this i drawn from, and can confirm that the tracks you see are present in the original CosmoDC2 as well, and I didn't somehow screw up and add them somehow here.

We have cuts on SNR and size, and for instance I'm not 100% sure off the top of my head how the metacal SNR is computed here, but @joezuntz would know.

The SNR is found by:

adding noise to the objects following Ivezic, Jones, & Lupton
getting the snr as (n_photon / background_mean) per band
adding them in quadrature for the r, i, z bands

aphearin commented 4 years ago

Thanks for sharing the plots @AngusWright. So I see two distinct sources of discreteness in these plots. The first has to do with distinct redshift snapshots of the halo catalog. Dan’s matchup method was designed to alleviate this, but it doesn't resolve it entirely.

The second kind of discreteness is purely to do with color PDFs, and is not related to the finite collection of snapshots. The discreteness in color-color-z space is ultimately caused by discreteness in the SEDs of the library of Galacticus galaxies from which we sampled to make the mock, so it is sensible/expected that you are finding it in cosmoDC2 as well as DC2.

I'm not sure what is meant by this comment though:

In the above figures, you can also see that the tomo_challenge data appear to be lacking any large-scale structure?

Could you elaborate?

AngusWright commented 4 years ago

Hi @aphearin. Thanks for the comments. What I mean w.r.t. the large scale structure: If you look at this figure then you can see the large scale structure (over- & under-densities) in MICE as the lighter and darker vertical bands. However you don't see such structures in DC2 (except maybe at z=0.2, but this feature looks like it might be caused by an overlap of two discrete snaps in redshift? The inverse happens at z=0.1, where there are gaps between the two steps). Is this because, e.g., redshifts were randomly assigned to galaxies within the redshift steps? This would wash out structures and give you this behaviour, I would imagine.

Regarding the discreteness of the colour-redshift distribution: Ok, I understand how having a finite number of SEDs can create such overdense trails. But I'm still confused about the lack of coherence between individual redshift steps. Take the upper panel in this figure for example. At z=0.56 and g-i = 1.3, there is a very highly represented SED (the thick black line). However this SED is apparently absent at all other redshift steps (well, at least in the adjacent snaps). It's as if this galaxy SED template popped into existence at z=0.54, was assigned to lots of galaxies between 0.54<z<0.56, and then disappeared at z=0.56 again. And this seems to happen a lot; it's hard to see many continuous SED models as a function of redshift at all.

Essentially, even if there were a small number of SED templates, I would expect to see a small number of contiguous lines as a function of redshift, rather than a large number of disjoint ones? But perhaps my intuition is just wrong!

aphearin commented 4 years ago

@AngusWright - thanks for clarifying. Re: lack of LSS. I think this is difficult to gauge from a scatter plot of color vs. redshift. If you're concerned about a lack of LSS, the two-point function might be a more convincing way to see the LSS in DC2 galaxies.

Re: discreteness. You have a good eye! In fact, exactly the sort of selection you describe does indeed happen in the matchup method, in which over a particular redshift range only, the same exact SED (or the same small handful) is preferentially selected over and over.

joezuntz commented 4 years ago

Yes, there's definitely large-scale structure in there! The sample is over a wide region, so it might be that it's being washed out in the geographic average? This is a random sub-sample over 400 sq deg.

Here's a density slice in 0.5 < z < 0.6 in a small region of the underlying catalog in ra, dec:

download-9

aphearin commented 4 years ago

When using the "jumpiness" of a scatter plot vs. redshift as a gauge for the presence/absence of LSS, one thing to remember is that the sky area and intrinsic number density of the sample has a direct impact on the visual perception.

AngusWright commented 4 years ago

Hi @joezuntz @aphearin, sorry, my wording wasn't clear. Yes the on-sky distributions show some large scale structure, but the actual structures themselves don't appear very strongly as a function of reshift. That's why they don't pop out in my colour vs redshift plot. See the below cone plots (i.e. RA vs z) as a demonstration, where I've plotted equal volumes (10x10 sqdeg on-sky, z<0.5, matched number density) for DC2 and MICE2. There is clearly some structure in the DC2 catalogue, but it's considerably weaker than the level I would expect. Perhaps this isn't relevant when looking at 2-point statistics in tomographic bins with \Delta z = 0.2, but looking at 2-point stats in in tomographic bins with \Delta z = 0.05 would give quite different results, I would expect? ConePlot_DC2_MICE2

joezuntz commented 4 years ago

Hi @AngusWright - could you compare the magnitude distributions for the two samples? I'm wondering if we're seeing the fainter LSST-like objects there.

aphearin commented 4 years ago

Good point @joezuntz. @AngusWright - when you say you are comparing "matched number densities", could you say a bit more about how this was done?

AngusWright commented 4 years ago

@joezuntz @aphearin The effect doesn't seem to be caused by magnitude. See the below. I've subset the sources in both catalogues to:

0.1 < z < 0.3
22.0 < r < 24.0
10x10sqdeg

The sources numbers in the two catalogues are quite different too. Originally I down-sampled the MICE2 catalogue to have the same number of sources as DC2 in the volume, just to make the plotting look fairer. I've not looked at the number counts in DC2 vs MICE2, but I'm not sure that the difference in this regime is expected; especially because the MICE2 sim (with KiDS level noise) has considerably more sources in the volume than DC2 does (with LSST level noise); 32K vs 25K. But in any case, I think that this is enough to make clear that the difference isn't magnitude driven.

ConePlot_DC2_MICE2_v2

joezuntz commented 4 years ago

One note - if you're using the files I released for this challenge, it's not the full DC2 - I randomly downsampled to split into different samples.

I also apply these cuts:

S/N (riz) > 10
size / PSF size > 0.5

could one of these be driving the difference?

AngusWright commented 4 years ago

Ah cool, yes that could definitely explain the difference in absolute number between MICE2 and DC2, which is good 👍

aphearin commented 4 years ago

@AngusWright - would it be laborious to use a (much) larger sky area? I'm a bit worried about over-interpreting the LSS with a scatter plot made from such a small patch.

AngusWright commented 4 years ago

100 square degrees is half the dataset used here, but sure. Below is using (nearly) the whole tomo_challenge dataset: 20x20 sqdegrees. Compared to the same area from MICE2. But I think that this just muddies the waters? If we collapsed the hemisphere onto the RA axis then it would look ~uniform in both cases, but the underlying issue of differing LSS would still be there. For this test a thinner slice in DEC is useful. So the second plot is the full RA extent of the tomo_challenge dataset, but only a 4sqdeg slice in DEC.

ConePlot_DC2_MICE2_v3 ConePlot_DC2_MICE2_v4

aphearin commented 4 years ago

Thanks for the sanity check on cosmic variance @AngusWright - that's pretty convincing that the difference with DC2 is not due to a randomly selected low-density void. To better understand the differences between the two models, probably the most effective way would be a direct comparison of the underlying HODs.

evevkovacs commented 4 years ago

@AngusWright Another check you could do is to look at the number density of the synthetic galaxies that are contributing to this plot. These galaxies have a negative halo_id. They were added uniformly to compensate for the mass resolution of the simulation and boost the number density at faint magnitudes. However the turn-on for this effect was not sharp and we know that there are some synthetic galaxies at the brighter magnitudes. At low redshifts, where the number of "real" galaxies is not that large, it could be that the tail from these synthetic galaxies is washing out the structure. You could exclude galaxies with negative halo_ids from the above plots or plot them in a different color. Thanks

joezuntz commented 4 years ago

I didn't incorporate information like the halo ID or other extended information into this reduced catalog. If people want to look into that then they'd need to look into the full catalog.

@AngusWright - if you are able to share your plotting code I can make the above with the synthetic galaxies removed?

AngusWright commented 4 years ago

@evevkovacs As @joezuntz says, unfortunately the halo_id information isn't in the catalogue, so I can't run that test myself. I have generated a quasi-check for that effect, though, assuming that the number of synthetic sources correlates just with magnitude. Attached are cone plots for DC2 and MICE2 (left and right respectively) where the sources are coloured [red,green,blue] for three equal-N magnitude bins between 20 ≤ r ≤ 24. You can see that, in MICE2, the faint and bright galaxies cluster together. In DC2, the faint galaxies are much more uniformly distributed, I'd say, judging by the number of bone fide voids in the two simulations. This would suggest that a significant fraction (perhaps even the majority?) of faint sources are synthetic in the magnitude range ~ 23 < r < 24? If the fraction of synthetic sources increases with magnitude, then the effect close to the LSST magnitude limit could be very significant?

@joezuntz I can send you this code but it's written in R, and would require you to install a couple of packages. I'm sure that there would be astropy code to do spherical to cartesian projections though? That's all this really does, plus a few bells and whistles. ConePlot_DC2_MICE2_v5

joezuntz commented 4 years ago

In 23 < r < 24, about 92% of the galaxies in the pre-cut sample are "real" (associated with a a halo and not added synthetically), so that doesn't seem to be the cause of this at these magnitudes.

aphearin commented 4 years ago

Really nice test @joezuntz! That's pretty convincing that the unclustered synthetic ultra-faints are not the source of the difference.

As a quick reminder, the need for synthetic ultra-faints was so that the very faint of the LF displayed completeness down to r~28; this was quite critical for blending applications, but since galaxies this faint typically have logMstar/Msun < 8, their parent halos were unresolved in the Outer Rim simulation, and so we just sprinkled them in spatially randomly due to a lack of any other validation data. Because most such galaxies have logMstar/Msun < 8, it is reassuring that this is not the issue.

By eye, the MICE lightcone appears to have its high-mass halos much more heavily populated relative to the DC2 lightcone. This is the sort of thing that could be made more quantitative by plotting the HODs, but maybe a quick version of this would just be to compare the satellite fractions between the two samples?

evevkovacs commented 4 years ago

@joezuntz Thanks very much for making the above check. It is very reassuring to know we were not getting bitten by the tail of the ultra-faint distribution.

evevkovacs commented 4 years ago

@AngusWright It would be good to have a more quantitative measure of the clustering than just the cone plot. The clustering in cosmoDC2 has been validated against SDSS data (see Fig 15 in the cosmoDC2 paper) for r< 21. The DESCQA test could be adapted to select galaxies with fainter magnitudes. The problem is finding validation data in this region. Was the clustering in the MICE catalog validated for r>22?

sschmidt23 commented 4 years ago

On a couple of the non-clustering issues: For the higher tail at high redshift than MICE, the DC2 p(z|m) was checked against DEEP2, as shown in Figure 13 of the cosmoDC2 paper. You can see that the DEEP2 r<22.0 sample does have a small tail extending to z~1.5, so I'd wonder more about whether MICE is missing a high redshift population.

For the sudden jumps in templates, I think (Eve or Andrew or others should say whether this is correct) that the Galacticus SAM models (and thus effective SEDs) are not fixed to the same set throughout the entire simulation, but are instead sampled at each redshift snapshot (actually, I think the paper says that there is an interpolation to five slices between each snapshot to reduce these discreteness issues, but the idea is the same). So, I think the sudden jumps in narrow color tracks that are obvious are due to these discrete sets of Galacticus properties/effective SEDs that change between the snapshots/interpolated slices. This is at least my understanding based on looking at section 5.3.4 of the cosmoDC2 paper.

evevkovacs commented 4 years ago

I followed up on the question I asked in my previous post. See Figs 8 and 9 of MICE II paper. The MICE clustering was validated for z~ 0.9 - 1.1 and 17 < i < 24 (Fig 8) and z ~0.45-.74 and 17.5 < r < 22.5 (Fig 9). I should also add that the cosmoDC2 clustering was validated at higher redshifts by comparing to DEEP2 data in the range 0.74 < z < 1.05 and stellar mass cuts of M > 10.5 and M >10.8. Would have to check how these M* cuts translate to magnitudes. This plot was not included in the cosmoDC2 paper but you can see it in Fig 4 in the validation paper draft or on the DESCQA web interface

evevkovacs commented 4 years ago

@sschmidt23 Responding to your second point above: yes, you are correct. The Galacticus galaxies that were used in cosmoDC2 were only available at discrete snapshots. We interpolated between these snapshots in 5 steps and built kde trees at each substep to find the best match to the properties already in the empirical model. In some regions of color-magnitude space, only limited matches were available so the same galaxy was resampled multiple times and the streaks that you see come from the redshift evolution of this same galaxy. On a different substep however, a different galaxy and hence different SED might be chosen, leading to the "jumps".

LSSTDESC / tomo_challenge

Clarifications on data properties #20