OBS data cmorization procedure, checklist and current issues

valeriupredoi commented 5 years ago

Summary

This is to summarize the procedures for cmorizing and including new OBS data. This is meant for those who are not aware of the procedures just yet. It also raises a few points that need to be discussed.

Approach

get the RAW data
write a cmorizer script for that data based on existing scripts as seen in cmorizers; remember there is a set of utilities that you can use deoending if the script is Python or NCL (utitlities.py or utilities.ncl);
open a github issue summarizing your requirements and actions and a pull request with your new cmorizer script; assign @mattiarighi as reviewer to your PR for scientific and code review (also assign @bouweandela or @valeriupredoi or @jvegasbsc for Python code review)
- follow the checklist below;

Checklist (by @mattiarighi as posted in #931 )

As checklist for the observation cmorizer:

test the cmorized data using the check recipe recipes/example/recipe_check_obs.yml, to make sure Iris can read them without errors
add the new dataset to the table in the documentation
inform @mattiarighi that a new dataset is available, it will be added to the OBS data pool at DKRZ and synchronized with Jasmin

Data export and availability

new cmorized data should be avaialble on DKRZ and by virtue of mirroring onto Jasmin/esmeval group workspace (main point of contact for mirroring @bjoernbroetz and @mattiarighi );
on Jasmin the data is to be found in /group_workspaces/jasmin4/esmeval/obsdata-v2; you will need memebrship to the group (see https://accounts.jasmin.ac.uk/services/group_workspaces/esmeval/ ); @valeriupredoi is the admin of the group and if you apply for membership you should be approved in a jiffy; @jvegasbsc has recently allowed group write access to the repo;
on DKRZ: the data is to be found in /mnt/lustre02/work/bd0854/DATA/ESMValTool2/OBS; you will need membership to the group (contact @axel-lauer or @mattiarighi)

Current issues

Q: How do we ensure that new data from Jasmin/esmeval gets mirrored to DKRZ? A: Each new dataset needs a CMORizer, which would need to go through a PR and a review. I always add new data to the DKRZ pool when testing the PRs. In that sense, DKRZ is the primary source and Jasmin the mirrored one...

mattiarighi commented 5 years ago

Thanks! Can you check whether this is consistent with what I wrote in the corresponding section in the user guide? It should, but maybe I missed something.

inform @mattiarighi that a new dataset is available, it will be added to the OBS data pool at DKRZ and synchronized with Jasmin

For that, you can assign me as a reviewer of the corresponding PR.

How do we ensure that new data from Jasmin/esmeval gets mirrored to DKRZ?

Each new dataset needs a CMORizer, which would need to go through a PR and a review. I always add new data to the DKRZ pool when testing the PRs. In that sense, DKRZ is the primary source and Jasmin the mirrored one...

jvegreg commented 5 years ago

About the raw data, we should put this in a centralized location on JASMIN or DKRZ for testing the pull requests? We must still check if the instructions for download are OK, but is better to do that downloading one year instead of 50

valeriupredoi commented 5 years ago

I'd argue the raw data should be downloaded from the remote source, last thing we want is to maintain a DB of raw data and God forbid if we don't account for their latest versions and someone publishes results using an obsolete version - plus, there isn't much room on Jasmin. We could have samples of raw data locally stored on Jasmin and DKRZ for testing the cmorization. Speaking of which - we don't really have a quality control for the cmorization itself - we should probably think of getting a set of QC routines for the output data itself. I can think of a set of already-available code

jvegreg commented 5 years ago

I was just talking about keeping full RAW in a common place before merging, to make easier to test/contribute to other people.

mattiarighi commented 4 years ago

This is in the documentation now, can we close? @valeriupredoi

bouweandela commented 4 years ago

I don't think the procedure for getting access to the data on Jasmin and DKRZ is in the documentation yet

valeriupredoi commented 3 years ago

the procedure is now mentioned in the doc header for each cmorizer script, whether it be JASMIN or elsewhere, so I am gonna close, pls reopen if deemed unresolved :+1:

bouweandela commented 3 years ago

I meant the procedure for getting access to the already cmorized data available on Jasmin and mistral. Is that documented anywhere?

remi-kazeroni commented 3 years ago

I meant the procedure for getting access to the already cmorized data available on Jasmin and mistral. Is that documented anywhere?

I have not found that in the documentation. I guess the only way at the moment is to look into the config-user.yml file and then discuss with your colleagues administrating the shared directories on Mistral and Jasmin to gain access to the Tier2 and/or Tier3 data.

bouweandela commented 3 years ago

Yes, it would be nice to advertise a bit more clearly who those colleagues are and how to contact them, because we do get questions from people who are trying to cmorize all that data themselves, while they could just have used the already available data.

valeriupredoi commented 3 years ago

OK I'll open a PR with documentation for that: on JASMIN I have myself and I need another bod in case I am on holidays in the wilderness (second admin of esmeval GWS is unreliable unfortunately) - can I volunteer any of Remi or Klaus, although I believe the second person should also be an admin of esmeval. For DRKZ we have Remi and who's the second fella?

remi-kazeroni commented 3 years ago

Thanks for reviving this V! For Jasmin, I don't think I'd be of any help because I never run computations there. It would be more appropriate to have someone using resources there. For DKRZ, I guess it should be @axel-lauer and myself. Nevertheless, the situation is quite different from Jasmin. Since we (almost) only have Tier2 data on Jasmin, there is no license issue when giving access to obs data. It is less clear to me how widely we can give access to Tier3 data on Mistral. Note that Mistral Tier2 data are already readable by any user of the machine.

Also, if someone is granted access to esmeval on Jasmin, does the user get automatically access to computing resources (cpu time, ...)? At DKRZ access to the shared groupspace is not decorrelated from resources. So giving access to our groupspace where obs data are stored would grant the user access to our computing time, disk space,... We may rather need to give user read access to Tier3 data on a case to case basis. So, you can start with the PR and we will see how we can contribute for the DKRZ part.

valeriupredoi commented 3 years ago

Also, if someone is granted access to esmeval on Jasmin, does the user get automatically access to computing resources (cpu time, ...)?

No, they get full access (read + execute) to the files inside esmeval, no other compute privileges, those are standard socialist ones for any user that gets access to JASMIN, unless they request access to SLURM high-mem nodes or high performance data transfer nodes, but that's after a special request and evaluation process. The problem with access to esmeval is that they get access to Tier3 data, and I, as a GWS admin, have no way of tracking what they do with it, but I guess that off my responsibilities/duty anyway.

OK I'll get cracking on the PR then! :beer:

schlunma commented 2 years ago

Moving this to v2.6 since the corresponding PR moved, too.

remi-kazeroni commented 2 years ago

Here is another thought on the Tier3 data accessibility issue. ERA data (raw format) are readily available on Mistral for all users. Quoting the DKRZ documentation: "ERA5 data is open access and free to download for all uses, including commercial use, after agreement to the terms of use. ERA Interim and all older ERA versions (ERA 40, 15) are now under “CC by” license, so they can be used and shared if cited properly."

Basically, anyone with an account on Mistral can access the ERA data. It is the user's responsibility to check the requirements to use ERA5 data and click on the link to sign the terms of use.

I was wondering if that could be an example of how we could open the access to our Tier3 datasets.

@valeriupredoi, out of curiosity: would you know if similar documentation exists for using ERA5 data on Jasmin? I think I heard the raw data are also publicly available to the users of the machine.

bouweandela commented 2 years ago

Note that ERA-Interim changed from Tier 3 to Tier 2 some time ago: https://github.com/ESMValGroup/ESMValTool/issues/1780.

sloosvel commented 2 years ago

How is the draft PR going? Can this be included in 2.6?

remi-kazeroni commented 2 years ago

@axel-lauer, do you think we could proceed with writing in the docs how users at DKRZ/Jasmin/IPSL can get access to our pools of cmorizer data (see #2385 for a draft)? Would it be possible to document that now or shall this wait until after a legal team has been consulted as discussed during the last workshop?

sloosvel commented 2 years ago

If discussion is still needed I would rather take this out from the milestone, sorry! Feel free to add it back if you are ready.

axel-lauer commented 2 years ago

@remi-kazeroni I am actually a little bit hesitant to encourage people to get access to Jasmin/DKRZ/etc to access the obs data. At least for DKRZ, that would typically also imply adding those people to our computing project, which I am not a fan of. I guess at this point I would prefer to keep this rather quiet until we found a more general solution.

valeriupredoi commented 2 years ago

this seems to be an ongoing debate so I'll move one click up to 2.8

remi-kazeroni commented 1 year ago

I'm bumping this to the next milestone. I think we first need to finish collecting feedback in the discussion that @rswamina and myself have initiated after the last workshop. I would invite everyone to contribute to the document that is linked from this discussion https://github.com/ESMValGroup/Community/discussions/70.

The feedback will be use to update our documentation and this could hopefully go into the v2.9 release.

zklaus commented 1 year ago

I know that in the meantime there were some changes in the documentation, but with changes in staffing, this needs a thorough review and should be tackled for the next milestone.

mo-gill commented 6 months ago

Hi, we are currently working on the ESMValTool release for v2.11.0. We're wondering if you'd be able to finalise this issue by the end of next week (Friday 10th May).

Otherwise, please let us know, and we'll move it into the next milestone for you :slightly_smiling_face:

bouweandela commented 6 months ago

This issue is ancient, mostly done, and many people in this discussion have now moved on to new jobs. Please open a new issue if more work is still needed.

ESMValGroup / ESMValTool

OBS data cmorization procedure, checklist and current issues #1072