LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Using HPSS to free space at NERSC on projecta #343

Closed heather999 closed 4 years ago

heather999 commented 5 years ago

We are down to 77 TB on /global/projecta/projectdirs/lsst. We will need more room to store upcoming DC2 Run2.1i simulation and DM processed data.

We have shared project space on HPSS which is accessible by all DESC NERSC users via Globus, /home/projects/desc. This makes it much easier and faster for DESC collaborators to access data that is on tape. Given that, we should feel free to use HPSS as much as is reasonable.

A new Confluence page is being written which contains a list of what is in HPSS and how to access it: https://confluence.slac.stanford.edu/display/LSSTDESC/NERSC+HPSS

fjaviersanchez commented 5 years ago

@heather999, please, feel free to remove everything except for the deepCoadd-results (168GB) subdirectoy once they are in HPSS (actually I think that they are already backed-up and that we don't even need IcSrc nor IcExp nor ref_cats).

Also please keep the registry (registry.sqlite3) and the mapper (_mapper)

katrinheitmann commented 5 years ago

@danielsf and @villarrealas Just to be a little more concrete about the instance catalogs: Once Antonio has done the image sims (e.g. now for Y1 and Y2), do we need the instance catalogs for anything else? For example, truth information? Of course we want to keep the instance catalogs that have not been simulated on projecta for now. Thanks a lot!

villarrealas commented 5 years ago

Naively, I think so, but I'm not sure what people would like to keep available as truth. I would imagine the CosmoDC2 catalog that was an input into the instance catalogs would be sufficient, but I could be wrong.

yymao commented 5 years ago

Some numbers:

All the pre-cosmoDC2 CS catalogs are just about 1TB (stored in /global/projecta/projectdirs/lsst/groups/CS/descqa/catalog)

For cosmoDC2 (stored in /global/projecta/projectdirs/lsst/groups/CS/cosmoDC2):

version size (TB)
v0.1 1.5
v0.2 1.7
v1.0.x 18
v1.1.0 11
yymao commented 5 years ago

/global/projecta/projectdirs/lsst/groups/CS/MBII-DMO amounts to 2.3 TB. This potentially can be removed once backed up (did we back up the halo catalogs and trees?)

It was used by DESCQA paper (which is now done) and some other DESC members, but I am not sure if they still need them.

cwwalter commented 5 years ago

What is the plan for making truth tables now? I mean what will go in them?

There is stuff in the instance catalogs that is not in CosmoDC2 that is usefully (for example) for matching. There are stars of course but also the various components of the objects, disk/bulge/knots. We also have information in the centroid files so we might want to see if those have everything we need for truth tables.

danielsf commented 5 years ago

Anything that is in the InstanceCatalogs can be reproduced directly from cosmoDC2 and the various supporting data files in /projecta/.../groups/SSim/DC2/. My opinion is that, once an InstanceCatalog has been simulated, it can be moved to tape. Antonio is the final arbiter of which catalogs have been simulated and which have not.

heather999 commented 5 years ago

@villarrealas how do we go about determining which catalogs have been simulated and are ready to be stored only on tape?

villarrealas commented 5 years ago

Two methods - one is seeing if their subdirectory exists in y1-y2-wfd (or similar). That would suggest the data has already been run. The alternative is I could provide a list corresponding to that same information once Cori is back.

cwwalter commented 5 years ago

Anything that is in the InstanceCatalogs can be reproduced directly from cosmoDC2 and the various supporting data files in /projecta/.../groups/SSim/DC2/. My opinion is that, once an InstanceCatalog has been simulated, it can be moved to tape. Antonio is the final arbiter of which catalogs have been simulated and which have not.

OK, that's good! But, I still think hearing from @wmwv about what truth databases we are making and how we will actually use those CosmoDC2 files to do this makes sense. It's good to understand if it would take a lot of CPU and I/O resources to re-create the information we are putting in the instance catalogs if we need that information (so that we don't burn CPU doing the same thing we have already done). In that case it might be good going forward to use those files before they are moved off disk.

heather999 commented 5 years ago

But just to be clear, those instance catalogs would be available on HPSS even if they are removed from projecta. They can easily be copied to CSCRATCH if people need them. We can now do this using Globus, and that will make such transfer much easier and faster.

heather999 commented 5 years ago

After discussion at the DC2 meeting this morning - I feel we have general agreement that we can back up all of the instance catalogs to HPSS, and remove those that have been simulated from projecta. @villarrealas it would be helpful to get a list of visits that have been simulated - just as independent confirmation that we're in agreement about which ones can be removed. In the meantime, I'll start a back up of all the instance catalogs.

danielsf commented 5 years ago

Just to add a soupcon of urgency to this conversation: the final InstanceCatalog generation job started after Cori came back up last night. I expect it will end up generating between 50TB and 55TB of data. Because of the workflow (generate the catalog; gzip the catalog; tar the catalog; gzip the tarball) I cannot promise that the footprint on disk won't actually peak at some value higher than that before the job ends.

danielsf commented 5 years ago

OK, that's good! But, I still think hearing from @wmwv about what truth databases we are making and how we will actually use those CosmoDC2 files to do this makes sense. It's good to understand if it would take a lot of CPU and I/O resources to re-create the information we are putting in the instance catalogs if we need that information (so that we don't burn CPU doing the same thing we have already done). In that case it might be good going forward to use those files before they are moved off disk.

My experience doing validation of the InstanceCatalogs has been that reading an InstanceCatalog into a dataframe and crossmatching it in such a way as to produce, for instance, a light curve for a variable source is itself a very expensive process. Given that all of the truth information for static galaxies is now directly contained in cosmoDC2 and we only need to generate special truth tables for variable sources, my educated-but-still-back-of-a-napkin guess is that regenerating the light curves from scratch will be the most efficient option.

katrinheitmann commented 5 years ago

Do you mind summarizing the answer to Chris's question about the truth catalogs here so we have it all in one place for later access? Thanks a lot!

danielsf commented 5 years ago

@villarrealas

What if we moved the year 7-10 InstanceCatalogs to HPSS on the assumption that we aren't going to get there any time soon? When you think we are getting close to simulating those years, we can copy them from HPSS back to projecta.

danielsf commented 5 years ago

(or am I forgetting about the UK ImSim efforts?)

danielsf commented 5 years ago

Katrin backchanneled me and asked me to state the reasoning behind my belief that we do not need the InstanceCatalogs for truth information. Here it is:

1) Everything in the InstanceCatalogs can be generated from other data sources, and those data sources store the information more efficiently because they are not text files and do not duplicate information (recall that InstanceCatalogs represent pointings of the telescope meaning that, even if an object is perfectly static, it must be replicated in each InstanceCatalog that observes it).

2) Static truth information for galaxies is now contained in cosmoDC2. Since we hacked the SED normalization to demand that the magnitudes simulate by ImSim exactly match the mag_*_lsst columns in cosmoDC2, we do not need the InstanceCatalogs to tell us what the simulated magnitudes of static galaxies are. Even the disk/bulge/knot breakdown can be calculated from columns in the cosmoDC2_v1.1.4_image_addon_knots version of cosmoDC2. The only effect not included in cosmoDC2 is Milky Way dust. In order to include that in the truth information, we have to read in the galaxy's SED from disk, normalize and redshift the SED, and then apply dust extinction. This must happen even if we start from InstanceCatalogs, since InstanceCatalogs only contain the SED file name, the redshift, and the normalizing magnitude (which is not the observed magnitude). Reading and processing the SED in this way is usually the most expensive step in any process that involves handling SEDs, so we won't gain anything by starting from InstanceCatalogs as opposed to the data products used to produce the InstanceCatalogs.

3) As was stated above: because InstanceCatalogs only contain an SED file name and normalizing magnitude, we will still need to read in the SED and process it in order to produce light curves for time varying sources. There are more efficient, vectorized ways to do this in CatSim for all sources except supernovae. Even the supernovae, however, will probably be more efficiently generated from the data products underlying the InstanceCatalogs since they will be able to manipulate the SED entirely in memory, rather than reading a text file from disk for each observation of each supernova.

villarrealas commented 5 years ago

Moving wfd year 1-2 and year 7-10 to tape seems like it should be fine, given we do not need instance catalogs to validate.

danielsf commented 5 years ago

@heather999

Given Antonio's blessing above, you can move the following subdirectories of

/global/projecta/projectdirs/lsst/production/DC2_ImSim/Run2.1i/instCat/

to HPSS:

00000000to00071840/
00071840to00133541/
00133541to00201989/
00201989to00262897/
00262897to00327707/
00327707to00385844/
00385844to00445379/
00445379to00497969/
01713247to01784692/
01784692to01853668/
01853668to01920327/
01920327to01977250/
01977250to02044996/
02044996to02107672/
02107672to02168945/
02168945to02221327/
02221327to02280341/
02280341to02336181/
02336181to02391522/
02391522to02447932/
evevkovacs commented 5 years ago

The following directories in /global/projecta/projectdirs/lsst/groups/CS/cosmoDC2 can be moved to HPSS: GalacticusLibraries baseDC2_cosmoDC2_v0.1_v0.2 baseDC2_9.8C_v1.1 cosmoDC2_v0.2.0 cosmoDC2_v1.1.0 cosmoDC2_v1.1.0_knots_addon cosmoDC2_v1.1.0_shear cosmoDC2_v1.1.3_rs_scatter_query_tree If the tar files have the same names as these directories, it will be clear what they pertain to. Thanks

heather999 commented 5 years ago

Status report.. we are now at 56 TB after the last round of instance catalog generation and some additional transferring of older DC1 data to HPSS. We are expecting about 26 TB from IN2P3 soon. I will quickly move to get the identified instance catalogs copied over to HPSS.

heather999 commented 5 years ago

Also working to clean up /global/projecta/projectdirs/lsst/production/DC1/DM/DC1-imsim-dithered which is backed up on HPSS in /home/d/desc/DC1-imsim-dithered I spent some time running fdiff on random files under calexp and deepCoadd and it does indeed look like the same directory of files. I found the email exchange from Sept. 2017 where we discussed removing icExp and would skip saving that directory to HPSS. I will go ahead and remove it from projecta as well. As requested by @fjaviersanchez , I will leave deepCoadd-results, _mapper, and registry.sqlite3 on projecta.

heather999 commented 5 years ago

@danielsf At the CI meeting we discussed and agreed that rather than moving Y7-10 instance catalogs to tape right now, we will move Y3-5 instead. Can you help identify precisely which instance catalog directories on projecta that refers to?

danielsf commented 5 years ago

For future reference, the mapping between years and obsHistID is

year 1 ends with 262897
year 2 ends with 497969
year 3 ends with 741642
year 4 ends with 991924
year 5 ends with 1235518
year 6 ends with 1476730
year 7 ends with 1713247
year 8 ends with 1977250
year 9 ends with 2221327

In terms of the sub-directories on cori, this means that year 3-5 are in

00497969to00560434
00560434to00609114
00609114to00677003
00677003to00741642
00741642to00808341
00808341to00869159
00869159to00934266
00934266to00991924
00991924to01059856
01059856to01116617
01116617to01181839 
01181839to01235518

(assuming that when you say "Y3-5" that is an inclusive range)

heather999 commented 5 years ago

Due to the transfer of the y1-y2-wfd calexps (~90TB), we are down to 37 TB on projecta. Working to verify that the instance catalogs Y1-Y3 have been properly copied to tape, and then they will be removed. Will then move on to Y4-6 and do the same. Ongoing discussion with UK concerning whether they are done transferring Y7-10 (as of today, they are not): https://lsstc.slack.com/archives/CJ50YVDD3/p1560420608006300?thread_ts=1560183781.005600&cid=CJ50YVDD3

katrinheitmann commented 4 years ago

@heather999 Since projecta is now retired, I suppose we can close this issue? Thanks!