Open heather999 opened 2 years ago
thank you @heather999 - your summary is accurate. i think most of the data we are working with right now is under active use so we don't need to put it on tape. this might change later of course.
i've (re)downloaded one set of the simulations at global/cscratch1/sd/awan/dbs_fbs_v1.7
which should have lsst
affiliation access enabled. there will be at least 2-3 similar folders that will be good to keep in the CFS area so that folks who need them can copy them directly, instead of having to keep downloading them (since local scratch is scraped after 12 weeks) or deal with HPSS tape.
can you please clarify if one can directly download things to CFS? and is there a time limit for this space like the one for normal scratch space (12 weeks, i think)?
Hi @humnaawan CFS has no purge policy like CSCRATCH, we maintain that area ourselves and is meant to be a more permanent place to store data. It's still a good idea generally to back things up to tape, but having it on CFS keeps the data available for active use.
Now the management of CFS depends on which subdirectory we're talking about. When I set up a new directory for the OSWG area, anyone in DESC will be able to write to that area - so you can download directly into that space. That area will have a 10TB quota. The OS co-conveners have the responsibility to keep an eye on the area and have the authority to organize how they prefer. Of course if you need any help just let us know. I'm in touch with NERSC to get that area set up.
The /global/cfs/cdirs/lsst/shared
area is under more control and not just anyone in DESC can write to that space. It is meant to store data that is of general interest to DESC. I can imagine storing Rubin survey simulation output to /global/cfs/cdirs/lsst/shared/external/rubin-sim-data
so it is generally available. To initiate a transfer into the shared
area, you can reach out here on GitHub and open an issue. In the future, I think we have to provide a facility for DESC members to transfer data into shared
or its successor, directly, but we need to talk about that more within the CO WG. A possible workflow would be for you to download the rubin sims into your /global/cfs/cdirs/lsst/groups/OS
area or your CSCRATCH and open an issue here and we'll copy the data into shared
.
Once I confirm with Joanne and Yao about a naming convention - I'll go ahead and copy the contents of /global/cscratch1/sd/awan/dbs_fbs_v1.7
into a new area under /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/dbs_fbs_v1.7
Would that be an acceptable directory name for this first set of sims?
Sounds reasonable to me. I don't have a strong on the directory name. I think more importantly we should create a README file under rubin-survey-sim-data
and document how were the datasets obtained, when, and by whom.
We can also consider adding an entry to GCRCatalogs
to store these metadata.
Sounds good. @humnaawan Can you help document how these datasets are obtained, when, etc? That will allow us to populate a README and then also utilize that information when an entry in GCRCatalogs is created.
We now have a /global/cfs/cdirs/lsst/groups/OS
area that the OS WG can use - just note the 10 TB quota.
I have also copied /global/cscratch1/sd/awan/dbs_fbs_v1.7
into /global/cfs/cdirs/lsst/shared/external/rubin-survey-sims-data
. We should still add a README and think about whether this data should have an entry in GCRCatalogs.
Hi @humnaawan any input so we can try to add some entries for these sim files to GCRCatalogs and provide a useful README?
hi all, my apologies for dropping the ball on this. i wanted to share some updates:
rubin-sim-dbs
, in /global/cfs/cdirs/lsst/groups/OS/
which contains some of the released simulations (from scheduler version 1.5, 1.6, 1.7, 2.0, 2.1 so far). there's also a readme in the folder now that includes some details, including how I downloaded the dbs etc.now a few questions:
/global/cfs/cdirs/lsst/shared/external/
. i was thinking that perhaps we can copy those that are used in DESC papers - but that itself is a lot since e.g. Locher+2021 used v1.5, 1.6, 1.7 dbs. maybe it doesn't make sense to copy anything over, especially since the dbs would, hopefully, remain permanently at the SDF facility now. what are your thoughts?/global/cfs/cdirs/lsst/groups/OS/
? naively i would think anyone can just read the dbs from the space directly. if they need to update the databases or want a different folder structure etc, they should copy the relevant data over to e.g. their scratch space. i'll be curious to hear your thoughts re e.g. preventing/discouraging overwriting the downloaded data (since it does take a while to download - the total size of everything in the rubin-sim-dbs
folder is 357G ..).GCRCatalogs
would entail and what the envisioned purpose would be - can someone please elaborate? in my experience, we've mostly, though not strictly, worked with derived quantities from these sims (using rubin_sim
which has replaced MAF
) so this would essentially be a way for folks to load the data without having to locate the .db files? even then, rubin_sim
reads in the .db file so i'm not sure i'm seeing the utility - apologies if i should know this already.and a request:
dbs_fbs_v1.7
, be deleted from /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/
(note that folder name has changed from rubin-survey-sims-data
mentioned above to rubin-survey-sim-data
). i say this is since that folder has a un-nested structure (compared to /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs/sims_featureScheduler_runs1.7/
) which means the sims categories are not apparent at first glance (which could be confusing unless you know what you are looking for). btw if this un-nested structure, in your view, is better, im happy to un-nest the subfolders in /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs/
.thanks so much!
Others should correct me if I'm wrong, but I think the main reason for adding the sims to GCRCatalogs
would be to make them easier to find; that is, make use of GCRCatalogs
' registry functionality. As it happens, we're in the process of designing and implementing a new, still nameless, package to act as a registry, but not implement other GCRCatalogs functionality. Since - I think - all we're interested in here is the registry functionality, you have a choice of putting it in GCRCatalogs now (a simple process just involving the creation of a small yaml file) or waiting for the new thing, which could take a couple months. Once the new thing is ready we'll adapt GCRCatalogs to use it internally for registration and will make entries for all catalogs currently accessible via GCRCatalogs.
Hi @humnaawan I think Joanne addressed the question about GCRCatalogs - how you want to proceed with that is up to you. Ultimately it is useful to provide some GCRCatalog registry entry or use the new system to make the databases easier to find... but we don't absolutely need to deal with that now.
As far as storing the files - I would really suggest copying whatever you what to share into /global/cfs/cdirs/lsst/shared/external/
The idea was to accumulate all files of interest to DESC in one place to hopefully make it easier to find, even if they are also available at SDF or elsewhere. This area is also being copied over to CC-IN2P3 and ultimately backed up to tape at NERSC. I think databases that were utilized as part of DESC Papers should appropriately be in this area and potentially all of those rubin-sims databases likely have a reason to be there too. A few hundred GBs is not very large. The /global/cfs/cdirs/lsst/shared
area is also set read-only for all DESC members - removing any concerns about over-writing the contents inadvertently while the OS area is obviously set up to be writable by all by default (that can be adjusted).
So I think I'm suggesting that OS point people to a copy under /global/cfs/cdirs/lsst/shared
and they can either read it directly from there or as you suggested, copy it and make any changes they need to.
I'll remove /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/dbs_fbs_v1.7
and i'm more than happy to copy over the whole /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs
to the /global/cfs/cdirs/lsst/shared/externalarea and get rid of
/global/cfs/cdirs/lsst/external/rubin-survey-sim-data`
Also pinging @yymao for any other comments.
Not much besides what @JoanneBogart has said. At current stage it'd be nice to have a registry entry (just a simple yaml file) in GCRCatalogs
so that we can keep track of all the data being shared. No need to provide a reader.
thanks all!
re registering the dbs withGCRCatalogs
: I dont have strong feeling re registering the dbs now with GCRCatalogs
vs. later with the unnamed package - I'm happy to put this task on my todo list and I can try to get to it before the new package is released; I can most likely do this e.g. during Sprint Week. where can I find the documentation re the yaml file(s) that need to be created?
re the storage itself: @heather999 thanks! im happy for the dbs to be copied over to /global/cfs/cdirs/lsst/shared
and that is the path we can advertise; it does relieve me that the files are read-only. I can add a note in /global/cfs/cdirs/lsst/groups/OS/
to not mess with the rubin-survey-sim-data
folder but use the shared
path for access. also thank you for removing the outdated /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data/dbs_fbs_v1.7
folder.
Documentation here: https://github.com/LSSTDESC/gcr-catalogs/blob/master/CONTRIBUTING.md#preparing-a-catalog-config
But since in this case there's no reader, the specification is more or less free form. I think you just need to add something like:
path: ^/path/to/catalog
creators: "Creator Name 1", "Creator Name 2", "Creator Name 3"
description: "A short, human-readable description of this specific catalog."
is_pseudo_entry: true
One small thing - we unfortunately don't have a uniform way to specify the path to the catalog, but there are a small number of regularly-used keywords. As far as I know path
isn't one of them. For catalogs consists of several files in a directory (so the directory is the thing to be referenced), base_dir
is probably the most common so that's what I recommend. If you just want to point to a single file, maybe use filename
. I've seen some examples of that.
@humnaawan I've started the copy of the data into shared. I just wanted to clarify one point. Do you want to me to rename
/global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data
to `/global/cfs/cdirs/lsst/shared/external/rubin-sim-dbs' so it matches what the OS directory name?
hi @heather999 thank you for updating the directory name to /global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data
- it looks great.
I have a couple of questions/requests:
/global/cfs/cdirs/lsst/shared/external/rubin-survey-sim-data
does not contain the old
subfolder that is in /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs
?rubin-sim-dbs
from /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs
? I realized that there were a lot of empty folders in the folders, which I have cleaned up a little (readme is updated to reflect this). there's also a new folder sims_featureScheduler_runs2.99
with the latest sims.Hi @humnaawan
Concerning the old
subdirectory - I had just assumed anything in old
was no longer of interest. Should this subdirectory be included in the /global/cfs/cdirs/lsst/shared/external/ area?
I'll start a recopy and just go ahead and include old. We can always remove it later.
Hi @humnaawan I have re-copied the /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs
area to /global/cfs/cdirs/lsst/shared/external. The old
subdirectory has one file (that I can see). I don't have read access to it: minion_1016_desc_dithered_v3.db
. I can copy over old
if you do chrgrp -R lsst old
hi @heather999 so the db in old
is the db we used for DC2, I believe. at least a variation of it also went into some of the DESC publications (e.g., COSEP, 2018 DESC SRD); I say a variation since I think we used the original/undithered-but-post-processed-dithererd version in these publications but I don't recall the details 100%.
I thought having an old benchmark might be good in case someone wants to do comparisons (although a lot has changed since minion_1016
so maybe this is not really something to consider). if you think we should not have old
in the external
folder, thats fine.
thanks for moving over the latest simulations. we're likely going to have another update in the coming days so I'll post here once ive downloaded them to /global/cfs/cdirs/lsst/groups/OS/rubin-sim-dbs
.
OSWG co-convener, @humnaawan, reached out concerning data storage at NERSC. Right now they are predominately using CSCRATCH and it seems clear that some or all of this data belongs under CFS. Likely some of that data belongs in the
/global/cfs/cdirs/lsst/shared
area.Here is what I understand: Rubin survey simulations team creates simulation sqlite databases available at NCSA that the OSWG brings over to NERSC so they can work with those sqlite dbs and performs some analysis. Some of these simulation databases have been used for their metrics paper - this data may more properly be stored to tape only, depending on OSWG plans to use that data now or in the future.
Draft To Do list
shared
or in the OS group space while work is ongoing - CO + OSshared
- Joanne, Yao, Heather + OS