Updates to ACS data management plan

hot007 commented 2 years ago

While we're here, can we rename 'AWAP' to 'AGCD' please! :)

DamienIrving commented 2 years ago

@hot007 The existing AWAP directory on xv83 is actually some old modified AWAP data. The provenance of the modifications is a little unclear so it should probably be migrated to a personal directory in xv83 (and probably shouldn't be included in the new xv84).

The AGCD data that can go in the new xv84 is currently at /g/data/xv83/dbi599/agcd (and https://github.com/AusClimateService/agcd)

hot007 commented 2 years ago

Sorry about commenting on a merged PR, just seems like the right place for it... Paola and I are reviewing this now. Comments: xv84 maybe change wording in some places to say "ACS and other researchers" to make clear it's available semi-publicly. The corollary to that is that data like 'authoritative' that is for limited reuse (don't want people publishing on it before it's fully QC'd and ready to move to xv85, but may need access for validation or other model development) and that data shouldn't be too public, so at the risk of project proliferation might need two projects for authoritative vs replica/post-process.

hot007 commented 2 years ago

Re AGCD need to be careful - the research-only collection at NCI is in zv2 (I think, can check) but CSRIO have a commercially licenced product on our internal project which we should draw from (even if it just means copying the same data back and forth!) so we're clear on permission to use.

chloemackallah commented 2 years ago

Regarding separating authoritative vs replicas; I'm hesitant to split this our to even more projects due to the overhead. Perhaps some clear directions on which datasets are ready for use in publication; i.e. pretty much anything that isn't either a replica or in xv85 shouldn't be published on?

Also, I'm wondering whether post-processed reference datasets and straight up replicas should be kept separate which could create confusion. Perhaps need to be kept under a single directory 'reference_datasets', then split by replica/processed.

DamienIrving commented 2 years ago

I agree that we probably don't want even more projects. As we're finding out just from having two projects (xv83 and the just discovered ia39), estimating how the 3PB of storage associated with ACS should be split between just two projects is hard enough.

While I can see that it is potentially a little confusing to have the same dataset in both the replicas (e.g. original version) and post-processed (e.g. regridded version) directory, the nice distinguishing feature about authoritative/, post_processed/ and replicas/ directories is the documentation requirements that go with anything stored in those directories:

authoritative/ requires basically all the information required of a dataset listed in the NCI data catalogue
replicas/ requires details of when and where the data was downloaded
post_processed/ requires a code subdirectory (which is a preferably a git repo) that has details of the code, environment and data processing steps used to do the post processing

I guess the alternative would be to have dataset names as the highest directory level and then authoritative/, post_processed/ and replicas/ underneath. e.g.

ia39/
├── admin/
├── CAFE60v1/
│   ├── authoritative/
│   │   └── README
├── AGCD/
│   ├── replica/
│   │   ├── README
│   ├── post-processed/
│   │   ├── code/
│   │   └── data/

hot007 commented 2 years ago

That's an interesting proposal. Might be worth surveying the user base to see which they prefer? I can see advantages each way but actually given people usually know the name of the dataset they're looking for (and are far more likely to know that than what state of copy it is in at NCI!) the second might be the more intuitive way to store it, even though it's arguably less functional on a management level?

DamienIrving commented 2 years ago

I think it's almost certain that users will find things easier to navigate if the dataset name is the highest directory level.

chloemackallah commented 2 years ago

That kind of structure is a good idea, and the inversion of dataset name/type will allow more flexibility later.

We will just have to keep in mind how we want to specify/name datasets that are less clearly defined, such as various scales of downscaling/modelling/bias-correction/etc. With ESCI, a challenge was clearly identifying which datasets where which because of this complexity (e.g. driving_GCM, CMIP6_exp, RCM, bias-corr_methodology), and this is going to be even more complex in ACS. Possibly a consideration for down the track, but perhaps these would be self-contained in Work Package-defined directories, and then with some best-practice guidance on naming, etc.

AusClimateService / data-code-group

Updates to ACS data management plan #3