cedadev / ccmi-2022

CCMI-2022 - in support of the WMO/UNEP Scientific Assessment of Ozone Depletion Report 2022
BSD 2-Clause "Simplified" License
0 stars 4 forks source link

Directory Structure for CCMI-2022 #4

Open charliepascoe opened 3 years ago

charliepascoe commented 3 years ago

Agree a directory structure for CCMI-2022

charliepascoe commented 3 years ago

Here is the directory structure for CCMI-1 data/CCMI-1/output/<institution_id>/<model_id>/<experiment_id>/<frequency>/<modeling_realm>/<mipTable>/<ensemble-id>/<version>/<variable> e.g. data/CCMI-1/output/MOHC/HadGEM3-ES/refC1/mon/atmos/monthly/r1i1p1/v1/zmo3

charliepascoe commented 3 years ago

The only thing I would change about the directory structure for CCMI-2022 would be in the way we implement the version directory. I would recommend that we use the date of the upload (yyyymmdd) as the version identifier rather than an incremental integer. In this way the CCMI-2022 directory structure would be more similar to the conventions used by CMIP6.

ccmi1-test commented 3 years ago

I am quite happy with the suggestion to move from a version number to a date. I noticed the CMIP6 DRS specifies the to have a 'v' at the beginning as . I think it would be helpful here, if it is not too much trouble? CMIP6 also puts the number at the very lowest level in the directory structure. I can see advantages to having it at the end as people would normally search for particular variables, and it is only at that point different versions become important. On a less rational level, not being the most technically savvy user I do sometimes root around in the archive manually and the directory always bugged me. It feels a bit redundant given the . Does it simplify things for organizing the data or data discovery or could it be dropped?

ccmi1-test commented 3 years ago

I just realized my previous comment was missing an important word. Where I was commenting about what I felt was a 'redundant' directory level, I was referring to the directory.

ccmi1-test commented 3 years ago

Maybe it wasn't me... Seems when I put the word 'frequency' in angle brackets it disappears. Maybe <_frequency_> works better?

charliepascoe commented 3 years ago

I agree that the <frequency> directory seems redundant in the directory tree (given that we also have the <mipTable> directory. However, given that we want to be able to make the CCMI-2022 directory structure compatible with the ESGF CMIP6 data I think we need to keep it.

charliepascoe commented 3 years ago

Maybe it wasn't me... Seems when I put the word 'frequency' in angle brackets it disappears. Maybe <_frequency_> works better?

Yes it is weird I'm using the "insert code" feature from the menu above the comment box to make the content my angle bracket text appear.

charliepascoe commented 3 years ago

With respect to the directory structure of the archive, we have a little more time to work that out. Aside from the version, the filenames will have sufficient information to be able to place them in the directory tree. I use the unix find command to select the files that people upload so it really doesn't matter what structure they arrange them in. They can either upload a flat list of files or files in a directory tree.

My suggestion for structure of the upload area is that the users make a ccmi-2022 and then version directories beneath that for each batch of files they upload.

ccmi1-test commented 3 years ago

Sounds good. I'll add in the directions of how to submit data that reflect your comments. So when a user creates an account and logs on to upload data would they be in their own subdirectory or a shared space? To put the question another way, do they need to create an initial account along the lines of <source_id>?

charliepascoe commented 3 years ago

When a user logs in to upload data they would be in their own subdirectory on our arrivals machine. I then look in their personal directory on the arrivals machine to find their ccmi-2022 data uploads.

ccmi1-test commented 3 years ago

I am trying a test upload to CEDA now, using the web interface on arrivals.ceda.ac.uk. I have created a new delivery as the first step, which takes me to the page where I need to assign a name. When I give the delivery the name 'ccmi-2022' I can see that the name becomes part of the directory structure, and if I am using ftp (or rsync, I assume?), users can then create a version subdirectory (vYYYYMMDD) and deposit data under that. I have done that and uploaded a single file by ftp. But when I come back to the web page and go to 'Review submission' I am told the delivery was not submitted because of the the missing metadata.yaml file. If the CCMI data submission requires a yaml file, can we provide groups with a template?

ccmi1-test commented 3 years ago

One other question I have - Under 'Other upload methods' for ftp or rsync the system generates a new password. Does this password only affect the arrivals machine for ftp or rsync? The next time I go to log on to the arrivals web interface, will I need to use the new password? And is there an option to use my original password for the data transfer or must users always click on 'Generate new password' and use the system generated one?

charliepascoe commented 3 years ago

I've just had a look in your dplummer directory in our arrivals/users area and I can see the toz_monthly_CMAM_refC1_r1i1p1_196001-201012.nc file that you uploaded today so it looks like that part worked.

The metadata.yaml file is used to collect information for the catalogue records. As far as I am aware, the .yaml file is not a requirement for uploading data but we can use yaml files as a method for the ccmi-2022 modelling groups to provide information for their catalogue records.

I haven't heard of the other upload methods generating new passwords, I'll follow that up with my colleagues in the morning.

ccmi1-test commented 3 years ago

Through the web interface I can see the file is on the arrivals machine, but then the website asks me to review and confirm submission before submitting delivery and that is where the process is stopped by the requirement for the yaml file. Maybe I am making this too complicated by going through the web interface? Is it okay if users just ftp or rsync to arrivals.ceda.ac.uk and create the required directories (CCMI-2022/vYYYYMMDD) and deposit data there?

ccmi1-test commented 3 years ago

I can also add that while I can still log on to my CEDA account using my original password, this does not work for the ftp connection to arrivals.ceda. I now need to use the password that was generated by the upload page when I clicked on "Other upload methods > ftp".

ccmi1-test commented 3 years ago

I just double-checked the CMIP6 DRS document and the directory structure is given as: <mip_era>/<activity_id>/<institution_id>/<source_id>/<experiment_id>/<member_id>/<table_id>/<variable_id>/<grid_label>/<version> This would then be the exact same structure used here?

charliepascoe commented 3 years ago

Yes, we'd use the same directory structure.
Martin noticed that the ccmi-2022 CV has set cmip6 as the mip_era, he has suggested we use post-cmip6 instead https://github.com/cedadev/ccmi-2022/issues/8

charliepascoe commented 3 years ago

comparing with our cmip6 data the ccmi-2022 dir structure will be ccmi-2022/data/<mip_era>/<activity_id>/<institution_id>/<source_id>/<experiment_id>/<member_id>/<table_id>/<variable_id>/<grid_label>/<version>

charliepascoe commented 3 years ago

The ccmi-2022 CV has mip_era set to cmip6. Our archive structure has mip_era set to post-cmip6. Is it a problem for downstream software if they are not the same? https://github.com/cedadev/ccmi-2022/issues/8