CosmoDC2_5000 production: remove extraneous galaxies and properties

evevkovacs commented 5 years ago

CosmoDC2_5000 will be delivered in healpixel files covering 1 octant of the sky, over 3 redshift ranges (0-1, 1-2, 2-3) Here is the slide from the DC2 telecon (https://docs.google.com/presentation/d/1T-E30F7JY-I5-BTGYyZO8mvgTMjOiGBJtT7e4uht2qg/edit#slide=id.g58c701cf6b_0_22) to frame the discussion:

Scaling up from cosmoDC2

Size of cosmoDC2_5000 with no reductions will be ~ 70 TB Includes many ultra-faint galaxies - not needed by any proposed analysis Includes extra galaxy properties (native properties) not in schema; cosmoDC2 has ~500 columns (~60 + ~30 LSST filters + 180 “SED” properties are in the schema) Would be more efficient for users to remove extra information *Would reduce the number of files needed; with Nside=32 and no cuts: 4608 files with sizes 5GB, 15GB, 20GB for the 3 z ranges

Possible reductions

-Up to a factor of 2 for removal of non-schema columns -Factors for cuts on faint galaxies:

Cut	r < 28	r < 29	Not All (u,g,r,i,z,Y) > 28	Not All (u,g,r,i,z,Y) > 29
Factor	4.6	2.4	2.8	1.8

rbiswas4 commented 5 years ago

@evevkovacs Quick question on this: When you say ultra-faint galaxies, do you mean galaxies of the same population that has been validated as part of the extra-galactic catalog but are dim at high redshifts (not an extension in the population to low brightness and smaller galaxies) ?

yymao commented 5 years ago

Cutting the overall size of the dataset is not equivalent to making it easier for users to access. The former should be driven by the constraints on the data production side (e.g., disk space, production time).

Hence, I think we should first determine the production constraints, which will help us figure out how to actually trim down the dataset so that the production cost is reasonable.

Once that step is done, we can then further discuss how to facilitate data access (e.g., smartly partitioning the datasets, using the reader to help access) to satisfying the needs of various groups.

katrinheitmann commented 5 years ago

I disagree. We should start from the science cases that we need to support. If there is nobody in DESC who will ever look at the information we provide, we should not waste resources. If there is indeed a science case that needs all 500 columns and down to all the faintest galaxies we can provide that data. If there is nobody who needs this, we should use the space for more important data products.

As I said at the telecon, while maybe 70TB vs 35TB is not a huge deal, we have to get into a mode where we make informed decision about data products that we will keep and data products that are not needed. If we continue to keep everything without justification, we will run out of space eventually.

evevkovacs commented 5 years ago

By "ultra-faint galaxies" I mean the synthetic galaxies that were added to cosmoDC2 to compensate for the mass resolution of the simulation. They fill out the population of galaxies at faint magnitudes for all redshifts and are needed to pass the dn/dmag test. See, for example, Apparent Magnitude Test

rbiswas4 commented 5 years ago

Thanks @evevkovacs ! So these are a population that would be pretty hard to find in other galaxy catalogs that people produce, and it has been validated. So, I wonder if rather than cutting off the galaxies entirely through a magnitude cut, it makes sense to save space by cutting down from 500 columns to a few summary columns for these galaxies? Obviously, good summary columns depend on science cases ... so we need to list those first.

Just to clarify: we are talking about what science cases will find the catalog level (not images) information on such galaxies useful..

Because these are small stellar mass but numerous, what is the fraction of total stellar mass content they have when added up ? I suppose I should be able to do that calculation with the DC2 extragalactic catalog and GCR-catalog right ?

fjaviersanchez commented 5 years ago

Just my two cents: Ideally having the full simulation would be great. However, if storage is a concern, and assuming that we have the ~400 sq-deg already fainter than r=28 (or 29 or any other cut), I don't see many science cases that will need the full 5000 sq-deg going that deep (maybe photo-z?).

joezuntz commented 5 years ago

Is there a list somewhere of which columns you're considering removing? And/or a description of the ultra-faint galaxies (e.g. are they clustered or randomly positioned?)

Also, if it's possible (and I know this may be a pain!), it would be useful to see an approximate SNR histogram of the

all objects including ultra-faint ones you're considering removing
all objects excluding the ultra-faint ones you're considering removing
just the ultra faint ones

then that would be useful to help understand. To be even more of a pain, some images with and without those objects, just to eyeball, would be useful.

rbiswas4 commented 5 years ago

@fjaviersanchez Good point! 400 sq degrees might be plenty for some of the things we might think of.

evevkovacs commented 5 years ago

Again, the ultra-faint galaxies were included in the catalog to supply galaxies for weak-lensing and deblending studies. Ultra-faints brighter than r=29 were used in the image simulations. The goal was to provide a set of galaxies with stellar masses that are complete to values that would fall below the mass resolution of the simulation. The ultra-faints are randomly distributed within a healpixel. They have been assigned a synthetic halo mass and and associated stellar mass based on an extrapolation of the subhalo mass function for the MultiDark Planck 2 simulation. (This is described in the cosmoDC2 paper). The only validation test that has been done on these objects is the number density test (dn/dmag). We know that number density as a function of the assigned magnitudes falls within the specified criteria. No other validation test had observational data that was relevant for these objects (because they are too dim).

evevkovacs commented 5 years ago

This issue also exits in the [cosmoDC2 repo issue #65 ] (https://github.com/LSSTDESC/cosmodc2/issues/65). @rmandelb posted comments there which I reproduce here for convenience:

Just to record some factors relevant to "science impact" based on today's DC2 telecon plus my own thoughts:

The v1.0 that was used for image simulations will remain as-is. As a result, studies that involve use of the image sims (with fainter galaxies below the flux limit) matched against the v1.0 extragalactic catalog used to produce the image sims will self-consistently have the fainter galaxies in both.
The science impact discussion in this issue should therefore focus on studies that will exclusively use the extragalactic catalog.
We should bear in mind that while the fainter galaxy populations are there in roughly the right abundance (see notes on how this was ascertained here), there is some (very) limited evidence that their colors might not be quite right, and we have no means of validating their clustering, sizes, etc. So use of galaxy populations below a certain magnitude limit for some science comes at considerable risk. We did use these faint populations in the image sims because we just needed them to exist as roughly (within factor of 2) the right number of faint blobs with properties that are not outright insane, so as to have suitably challenging blending between the bright galaxies we'll use for science and faint galaxies. This meant we could set a pretty low bar in terms of the realism of the faint population. Using them for other purposes requires us to factor in the inherent uncertainties in the properties of the faint sample plus their potential irrelevance to a real LSST analysis (i.e., an actual analysis in LSST will have to correspond to some reasonably-selected sample from the images, not including the marginal detections that are inflating these catalog sizes). If we decide that these analyses cannot use galaxies below a certain magnitude limit given both of those considerations, then it makes sense to exclude those populations from the 5000 sq deg catalog so that the file read-in and subsequent calculations will be more efficient for everybody.
- Also note that unlike for the image sims, regenerating the extragalactic catalogs with the faint populations later on is more feasible if a need for populations that were initially excluded is identified. It's not zero work, either, so there would have to be a solid justification. (Producing such a justification could involve use of the smaller-area version of cosmoDC2 that was used for the image sims and that includes the faint population.)

evevkovacs commented 5 years ago

@joezuntz You would have to make the SNR plots from DM products since cosmoDC2 does not have magnitude errors. The list of ~300 quantities not included in the schema are as follows: LSST filter luminosities for disk, bulge and total, with and without host extinction, SDSS filter luminosities for disk, bulge and total, with and without host extinction, baseDC2 properties (see below), emission lines for disk, bulge and total, B, V, He, LyC and OxygenContinuum filter luminosities (used for determining A_V, R_V and emission line properties), disk, bulge and total star-formation rates and metallicities, black-hole properties (mass, accretion rate, Eddington ratio) and some Galacticus information pertaining to the matchup between empirical and Galacticus galaxies. The baseDC2 properties which are generated from the empirical model are: _obs_sm_orig_um_snap, galaxy_id, halo_id, host_centric_vx, host_centric_vy, host_centric_vz, host_centric_x, host_centric_y, host_centric_z, host_halo_mvir, host_halo_vx, host_halo_vy, host_halo_vz, host_halo_x, host_halo_y, host_halo_z, hostid, is_on_red_sequence_gr, is_on_red_sequence_ri, lightcone_id, lightcone_replication, lightcone_rotation, mpeak, mvir, obs_sfr, obs_sfr_percentile, obs_sm, restframe_extincted_sdss_abs_magg, restframe_extincted_sdss_abs_magi, restframe_extincted_sdss_abs_magr, restframe_extincted_sdss_gr, restframe_extincted_sdss_ri, sfr, sfr_percentile, sm, source_halo_id, source_halo_mvir, target_halo_fof_halo_id, target_halo_id, target_halo_mass, target_halo_vx, target_halo_vy, target_halo_vz, target_halo_x, target_halo_y, target_halo_z, upid, vmax, vpeak, vx, vy, vz, x, y, z. Some of these are identical to those in the schema.

katrinheitmann commented 5 years ago

(and just for completeness the r~29 cut for the image simulations was discussed here: https://docs.google.com/presentation/d/1FekyEB5lEMy7m8Jst4_N5pK-a_iWsL-Po3Wx7E9wA-E/edit#slide=id.g47b4924efc_0_0 on page 14)

cwwalter commented 5 years ago

@evevkovacs I wonder, if you set this procedure up to make the files with a well defined cut, how much work would it be to re-run it if needed?

If no one thinks of anything compelling now that really justifies this (I have a couple of ideas, but I'm not really sure) perhaps you could make the sample now with Mike's suggested cut (at least 29 in all bands) giving people something concrete to use. Then, if after trying it someone identifies and actual case they are missing, one could re-run.

FWIW, I think the possible cases might come up when we use the full depth processed images to build emulators for things like WL cases and then want to run on the full larger sample.

katrinheitmann commented 5 years ago

I am not understanding your last comment. We don't have full depth processed images. As mentioned above, for the image simulations we cut at r~29 (as we had agreed to).

katrinheitmann commented 5 years ago

I also think it would be great if people would use the cosmoDC2_440 catalog (which has no cuts at all) to determine what would be missed if we do make cuts. Maybe that would take too long for arriving at a decision?

cwwalter commented 5 years ago

I am not understanding your last comment. We don't have full depth processed images. As mentioned above, for the image simulations we cut at r~29 (as we had agreed to).

Oh.. so what I meant was there was also a suggestion to possibly cut shallower (e.g. at 28). So, at least for the purposes of emulators that we build using processed images that we want to then use with the larger catalog, it might be good to make sure the catalog is always at least as deep as the image one.

Then, in addition there could be other reasons that people have (even at 29... I don't know) and was asking generally how bad it would be to only rerun if people tried to use the catalog and saw there was a problem.

sschmidt23 commented 5 years ago

For photo-z, this is much fainter than we think will be usable. I recently brought up the fact that it seems a huge waste of compute time and storage space to compute p(z) for a bunch of faint galaxies that will have very poorly constrained photo-z's. Even at full 10-year survey depth the S/N in i-band, for example, is something like five at i=26.5-26.8, and we do not expect very good photo-z's beyond this point, so we were thinking of truncating p(z) compute/storage at around i<26.5. The median r-i color for the i~26.5 sample is about 0.5, and only about 1% of the i<26.5 galaxies have r>28. So, an r=28 and fainter cut seems to be fine for photo-z for the 5000 sq degree sample.

rmandelb commented 5 years ago

@sschmidt23 - question about something you wrote:

Even at full 10-year survey depth the S/N in i-band, for example, is something like 26.5-26.8

Should this be "Even at full 10-year survey depth the S/N in i-band, for example, is something like 5 for galaxies at 26.5-26.8"? (italicized bit was added by me - could be wrong, but I think something may have been missing...)

@joezuntz - going back to your question about SNR, in the absence of images you can use the single-visit point-source depths to estimate what it would be if you treat the galaxies as point-sources (which is of course a terrible approximation for the big/bright galaxies but OK for the really faint ones). I think Sam's calculations may be a useful guide.

sschmidt23 commented 5 years ago

Yes, sorry, I went to double check my numbers and then forgot to actually enter the S/N~5.

evevkovacs commented 5 years ago

Here is a plot showing the magnitude distributions (for cosmoDC2_v1.1.4_small) for u, g, r, i, z, Y for the "Jarvis" cut suggested above ( any(u, g, r, i, z, Y) < m_cut, where m_cut =28) :

The cut is most pronounced for Y band. Other distributions have tails extending to fainter magnitudes. This is for all galaxies, with no cuts on redshift.

drphilmarshall commented 5 years ago

Checking in from the SL catalog detection "Magnificat" project: @jiwoncpark and I don't need any objects fainter than 24 or so, so can work with pretty much any magnitude cut catalog the rest of you want to make. Looking at the columns, we might need all of them, for our emulator (I can imagine halo properties, star formation properties, red sequence properties etc all helping inform derp's attempts to predict LSST measurements), but we'd be fine with joining sub-tables together if you wanted to split the data by types of astrophysical property (eg halo properties could go in one table, spectroscopic properties in another, etc).

katrinheitmann commented 5 years ago

@evevkovacs Has a conclusion be reached on this issue? If so, we should close it. If not, we should come to an agreement and close it ...

rbiswas4 commented 5 years ago

Just to be more explicit: People I thought might be interested, but they were not that interested in extending catalogs beyond 400 sq deg.

boutigny commented 5 years ago

I had raised some concerns about the proposed magnitude cut. This was related to a project to use the CosmoDC2_5000 for gravitationnal wave stochastic background simulation, but we haven't been able to reach a definite conclusion on the usefulness to keep those faint galaxies. So, as far as this GW project is concerned, I would suggest to go ahead with the "Jarvis cut". And yes, we are very interested by the 5000 deg² catalog ! Thanks to the people involved in this catalog production.

rmandelb commented 5 years ago

And just to chime in from what I've seen, I am aware of at least two active projects that would be ready to make use of the 5000 deg2 catalog with the Jarvis cut.

evevkovacs commented 5 years ago

We will go ahead with the Jarvis cut. The tentative conclusion on properties is to keep all of them. We are just now incorporating a few model improvements into our pipeline, so are not ready to "push the button" quite yet.

katrinheitmann commented 5 years ago

With Eve's summary, I close this now. If anybody feels we need to discuss more, this would have to happen very soon...

LSSTDESC / DC2-production

CosmoDC2_5000 production: remove extraneous galaxies and properties #336

Scaling up from cosmoDC2

Possible reductions