LSSTDESC / ComputingInfrastructure

Gathering place for CI - Computing and Infrastructure - issues
3 stars 1 forks source link

Storing OSWG files at NERSC #60

Open heather999 opened 2 years ago

heather999 commented 2 years ago

OSWG co-convener, @humnaawan, reached out concerning data storage at NERSC. Right now they are predominately using CSCRATCH and it seems clear that some or all of this data belongs under CFS. Likely some of that data belongs in the /global/cfs/cdirs/lsst/shared area.

Here is what I understand: Rubin survey simulations team creates simulation sqlite databases available at NCSA that the OSWG brings over to NERSC so they can work with those sqlite dbs and performs some analysis. Some of these simulation databases have been used for their metrics paper - this data may more properly be stored to tape only, depending on OSWG plans to use that data now or in the future.

Draft To Do list

humnaawan commented 2 years ago

thank you @heather999 - your summary is accurate. i think most of the data we are working with right now is under active use so we don't need to put it on tape. this might change later of course.

i've (re)downloaded one set of the simulations at global/cscratch1/sd/awan/dbs_fbs_v1.7 which should have lsst affiliation access enabled. there will be at least 2-3 similar folders that will be good to keep in the CFS area so that folks who need them can copy them directly, instead of having to keep downloading them (since local scratch is scraped after 12 weeks) or deal with HPSS tape.

can you please clarify if one can directly download things to CFS? and is there a time limit for this space like the one for normal scratch space (12 weeks, i think)?

heather999 commented 2 years ago

Hi @humnaawan CFS has no purge policy like CSCRATCH, we maintain that area ourselves and is meant to be a more permanent place to store data. It's still a good idea generally to back things up to tape, but having it on CFS keeps the data available for active use.

Now the management of CFS depends on which subdirectory we're talking about. When I set up a new directory for the OSWG area, anyone in DESC will be able to write to that area - so you can download directly into that space. That area will have a 10TB quota. The OS co-conveners have the responsibility to keep an eye on the area and have the authority to organize how they prefer. Of course if you need any help just let us know. I'm in touch with NERSC to get that area set up.

The /global/cfs/cdirs/lsst/shared area is under more control and not just anyone in DESC can write to that space. It is meant to store data that is of general interest to DESC. I can imagine storing Rubin survey simulation output to /global/cfs/cdirs/lsst/shared/external/rubin-sim-data so it is generally available. To initiate a transfer into the shared area, you can reach out here on GitHub and open an issue. In the future, I think we have to provide a facility for DESC members to transfer data into shared or its successor, directly, but we need to talk about that more within the CO WG. A possible workflow would be for you to download the rubin sims into your /global/cfs/cdirs/lsst/groups/OS area or your CSCRATCH and open an issue here and we'll copy the data into shared.

Once I confirm with Joanne and Yao about a naming convention - I'll go ahead and copy the contents of /global/cscratch1/sd/awan/dbs_fbs_v1.7 into a new area under /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/dbs_fbs_v1.7 Would that be an acceptable directory name for this first set of sims?

yymao commented 2 years ago

Sounds reasonable to me. I don't have a strong on the directory name. I think more importantly we should create a README file under rubin-survey-sim-data and document how were the datasets obtained, when, and by whom.

We can also consider adding an entry to GCRCatalogs to store these metadata.

heather999 commented 2 years ago

Sounds good. @humnaawan Can you help document how these datasets are obtained, when, etc? That will allow us to populate a README and then also utilize that information when an entry in GCRCatalogs is created.

heather999 commented 2 years ago

We now have a /global/cfs/cdirs/lsst/groups/OS area that the OS WG can use - just note the 10 TB quota. I have also copied /global/cscratch1/sd/awan/dbs_fbs_v1.7 into /global/cfs/cdirs/lsst/shared/external/rubin-survey-sims-data. We should still add a README and think about whether this data should have an entry in GCRCatalogs.

heather999 commented 2 years ago

Hi @humnaawan any input so we can try to add some entries for these sim files to GCRCatalogs and provide a useful README?

humnaawan commented 1 year ago

hi all, my apologies for dropping the ball on this. i wanted to share some updates:

now a few questions:

and a request:

thanks so much!

JoanneBogart commented 1 year ago

Others should correct me if I'm wrong, but I think the main reason for adding the sims to GCRCatalogs would be to make them easier to find; that is, make use of GCRCatalogs' registry functionality. As it happens, we're in the process of designing and implementing a new, still nameless, package to act as a registry, but not implement other GCRCatalogs functionality. Since - I think - all we're interested in here is the registry functionality, you have a choice of putting it in GCRCatalogs now (a simple process just involving the creation of a small yaml file) or waiting for the new thing, which could take a couple months. Once the new thing is ready we'll adapt GCRCatalogs to use it internally for registration and will make entries for all catalogs currently accessible via GCRCatalogs.

heather999 commented 1 year ago

Hi @humnaawan I think Joanne addressed the question about GCRCatalogs - how you want to proceed with that is up to you. Ultimately it is useful to provide some GCRCatalog registry entry or use the new system to make the databases easier to find... but we don't absolutely need to deal with that now.

As far as storing the files - I would really suggest copying whatever you what to share into /global/cfs/cdirs/lsst/shared/external/ The idea was to accumulate all files of interest to DESC in one place to hopefully make it easier to find, even if they are also available at SDF or elsewhere. This area is also being copied over to CC-IN2P3 and ultimately backed up to tape at NERSC. I think databases that were utilized as part of DESC Papers should appropriately be in this area and potentially all of those rubin-sims databases likely have a reason to be there too. A few hundred GBs is not very large. The /global/cfs/cdirs/lsst/shared area is also set read-only for all DESC members - removing any concerns about over-writing the contents inadvertently while the OS area is obviously set up to be writable by all by default (that can be adjusted). So I think I'm suggesting that OS point people to a copy under /global/cfs/cdirs/lsst/shared and they can either read it directly from there or as you suggested, copy it and make any changes they need to.

I'll remove /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/dbs_fbs_v1.7 and i'm more than happy to copy over the whole /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs to the /global/cfs/cdirs/lsst/shared/externalarea and get rid of/global/cfs/cdirs/lsst/external/rubin-survey-sim-data`

Also pinging @yymao for any other comments.

yymao commented 1 year ago

Not much besides what @JoanneBogart has said. At current stage it'd be nice to have a registry entry (just a simple yaml file) in GCRCatalogs so that we can keep track of all the data being shared. No need to provide a reader.

humnaawan commented 1 year ago

thanks all!

re registering the dbs withGCRCatalogs: I dont have strong feeling re registering the dbs now with GCRCatalogs vs. later with the unnamed package - I'm happy to put this task on my todo list and I can try to get to it before the new package is released; I can most likely do this e.g. during Sprint Week. where can I find the documentation re the yaml file(s) that need to be created?

re the storage itself: @heather999 thanks! im happy for the dbs to be copied over to /global/cfs/cdirs/lsst/shared and that is the path we can advertise; it does relieve me that the files are read-only. I can add a note in /global/cfs/cdirs/lsst/groups/OS/ to not mess with the rubin-survey-sim-data folder but use the shared path for access. also thank you for removing the outdated /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/dbs_fbs_v1.7 folder.

yymao commented 1 year ago

Documentation here: https://github.com/LSSTDESC/gcr-catalogs/blob/master/CONTRIBUTING.md#preparing-a-catalog-config

But since in this case there's no reader, the specification is more or less free form. I think you just need to add something like:

path: ^/path/to/catalog
creators: "Creator Name 1", "Creator Name 2", "Creator Name 3"
description: "A short, human-readable description of this specific catalog."
is_pseudo_entry: true
JoanneBogart commented 1 year ago

One small thing - we unfortunately don't have a uniform way to specify the path to the catalog, but there are a small number of regularly-used keywords. As far as I know path isn't one of them. For catalogs consists of several files in a directory (so the directory is the thing to be referenced), base_dir is probably the most common so that's what I recommend. If you just want to point to a single file, maybe use filename. I've seen some examples of that.

heather999 commented 1 year ago

@humnaawan I've started the copy of the data into shared. I just wanted to clarify one point. Do you want to me to rename /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data to `/global/cfs/cdirs/lsst/shared/external/rubin-sim-dbs' so it matches what the OS directory name?

humnaawan commented 1 year ago

hi @heather999 thank you for updating the directory name to /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data - it looks great.

I have a couple of questions/requests:

heather999 commented 1 year ago

Hi @humnaawan Concerning the old subdirectory - I had just assumed anything in old was no longer of interest. Should this subdirectory be included in the /global/cfs/cdirs/lsst/shared/external/ area?

I'll start a recopy and just go ahead and include old. We can always remove it later.

heather999 commented 1 year ago

Hi @humnaawan I have re-copied the /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs area to /global/cfs/cdirs/lsst/shared/external. The old subdirectory has one file (that I can see). I don't have read access to it: minion_1016_desc_dithered_v3.db . I can copy over old if you do chrgrp -R lsst old

humnaawan commented 1 year ago

hi @heather999 so the db in old is the db we used for DC2, I believe. at least a variation of it also went into some of the DESC publications (e.g., COSEP, 2018 DESC SRD); I say a variation since I think we used the original/undithered-but-post-processed-dithererd version in these publications but I don't recall the details 100%.

I thought having an old benchmark might be good in case someone wants to do comparisons (although a lot has changed since minion_1016 so maybe this is not really something to consider). if you think we should not have old in the external folder, thats fine.

thanks for moving over the latest simulations. we're likely going to have another update in the coming days so I'll post here once ive downloaded them to /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs.