Closed valeriupredoi closed 6 months ago
Thanks! Can you check whether this is consistent with what I wrote in the corresponding section in the user guide? It should, but maybe I missed something.
inform @mattiarighi that a new dataset is available, it will be added to the OBS data pool at DKRZ and synchronized with Jasmin
For that, you can assign me as a reviewer of the corresponding PR.
How do we ensure that new data from Jasmin/esmeval gets mirrored to DKRZ?
Each new dataset needs a CMORizer, which would need to go through a PR and a review. I always add new data to the DKRZ pool when testing the PRs. In that sense, DKRZ is the primary source and Jasmin the mirrored one...
About the raw data, we should put this in a centralized location on JASMIN or DKRZ for testing the pull requests? We must still check if the instructions for download are OK, but is better to do that downloading one year instead of 50
I'd argue the raw data should be downloaded from the remote source, last thing we want is to maintain a DB of raw data and God forbid if we don't account for their latest versions and someone publishes results using an obsolete version - plus, there isn't much room on Jasmin. We could have samples of raw data locally stored on Jasmin and DKRZ for testing the cmorization. Speaking of which - we don't really have a quality control for the cmorization itself - we should probably think of getting a set of QC routines for the output data itself. I can think of a set of already-available code
I was just talking about keeping full RAW in a common place before merging, to make easier to test/contribute to other people.
This is in the documentation now, can we close? @valeriupredoi
I don't think the procedure for getting access to the data on Jasmin and DKRZ is in the documentation yet
the procedure is now mentioned in the doc header for each cmorizer script, whether it be JASMIN or elsewhere, so I am gonna close, pls reopen if deemed unresolved :+1:
I meant the procedure for getting access to the already cmorized data available on Jasmin and mistral. Is that documented anywhere?
I meant the procedure for getting access to the already cmorized data available on Jasmin and mistral. Is that documented anywhere?
I have not found that in the documentation. I guess the only way at the moment is to look into the config-user.yml
file and then discuss with your colleagues administrating the shared directories on Mistral and Jasmin to gain access to the Tier2 and/or Tier3 data.
Yes, it would be nice to advertise a bit more clearly who those colleagues are and how to contact them, because we do get questions from people who are trying to cmorize all that data themselves, while they could just have used the already available data.
OK I'll open a PR with documentation for that: on JASMIN I have myself and I need another bod in case I am on holidays in the wilderness (second admin of esmeval
GWS is unreliable unfortunately) - can I volunteer any of Remi or Klaus, although I believe the second person should also be an admin of esmeval
. For DRKZ we have Remi and who's the second fella?
Thanks for reviving this V! For Jasmin, I don't think I'd be of any help because I never run computations there. It would be more appropriate to have someone using resources there. For DKRZ, I guess it should be @axel-lauer and myself. Nevertheless, the situation is quite different from Jasmin. Since we (almost) only have Tier2 data on Jasmin, there is no license issue when giving access to obs data. It is less clear to me how widely we can give access to Tier3 data on Mistral. Note that Mistral Tier2 data are already readable by any user of the machine.
Also, if someone is granted access to esmeval
on Jasmin, does the user get automatically access to computing resources (cpu time, ...)? At DKRZ access to the shared groupspace is not decorrelated from resources. So giving access to our groupspace where obs data are stored would grant the user access to our computing time, disk space,... We may rather need to give user read access to Tier3 data on a case to case basis. So, you can start with the PR and we will see how we can contribute for the DKRZ part.
Also, if someone is granted access to esmeval on Jasmin, does the user get automatically access to computing resources (cpu time, ...)?
No, they get full access (read + execute) to the files inside esmeval
, no other compute privileges, those are standard socialist ones for any user that gets access to JASMIN, unless they request access to SLURM high-mem nodes or high performance data transfer nodes, but that's after a special request and evaluation process. The problem with access to esmeval
is that they get access to Tier3 data, and I, as a GWS admin, have no way of tracking what they do with it, but I guess that off my responsibilities/duty anyway.
OK I'll get cracking on the PR then! :beer:
Moving this to v2.6 since the corresponding PR moved, too.
Here is another thought on the Tier3 data accessibility issue. ERA data (raw format) are readily available on Mistral for all users. Quoting the DKRZ documentation: "ERA5 data is open access and free to download for all uses, including commercial use, after agreement to the terms of use. ERA Interim and all older ERA versions (ERA 40, 15) are now under “CC by” license, so they can be used and shared if cited properly."
Basically, anyone with an account on Mistral can access the ERA data. It is the user's responsibility to check the requirements to use ERA5 data and click on the link to sign the terms of use.
I was wondering if that could be an example of how we could open the access to our Tier3 datasets.
@valeriupredoi, out of curiosity: would you know if similar documentation exists for using ERA5 data on Jasmin? I think I heard the raw data are also publicly available to the users of the machine.
Note that ERA-Interim changed from Tier 3 to Tier 2 some time ago: https://github.com/ESMValGroup/ESMValTool/issues/1780.
How is the draft PR going? Can this be included in 2.6?
@axel-lauer, do you think we could proceed with writing in the docs how users at DKRZ/Jasmin/IPSL can get access to our pools of cmorizer data (see #2385 for a draft)? Would it be possible to document that now or shall this wait until after a legal team has been consulted as discussed during the last workshop?
If discussion is still needed I would rather take this out from the milestone, sorry! Feel free to add it back if you are ready.
@remi-kazeroni I am actually a little bit hesitant to encourage people to get access to Jasmin/DKRZ/etc to access the obs data. At least for DKRZ, that would typically also imply adding those people to our computing project, which I am not a fan of. I guess at this point I would prefer to keep this rather quiet until we found a more general solution.
this seems to be an ongoing debate so I'll move one click up to 2.8
I'm bumping this to the next milestone. I think we first need to finish collecting feedback in the discussion that @rswamina and myself have initiated after the last workshop. I would invite everyone to contribute to the document that is linked from this discussion https://github.com/ESMValGroup/Community/discussions/70.
The feedback will be use to update our documentation and this could hopefully go into the v2.9 release.
I know that in the meantime there were some changes in the documentation, but with changes in staffing, this needs a thorough review and should be tackled for the next milestone.
Hi, we are currently working on the ESMValTool release for v2.11.0. We're wondering if you'd be able to finalise this issue by the end of next week (Friday 10th May).
Otherwise, please let us know, and we'll move it into the next milestone for you :slightly_smiling_face:
This issue is ancient, mostly done, and many people in this discussion have now moved on to new jobs. Please open a new issue if more work is still needed.
Summary
This is to summarize the procedures for cmorizing and including new OBS data. This is meant for those who are not aware of the procedures just yet. It also raises a few points that need to be discussed.
Approach
utitlities.py
orutilities.ncl
);Checklist (by @mattiarighi as posted in #931 )
As checklist for the observation cmorizer:
recipes/example/recipe_check_obs.yml
, to make sure Iris can read them without errorsData export and availability
/group_workspaces/jasmin4/esmeval/obsdata-v2
; you will need memebrship to the group (see https://accounts.jasmin.ac.uk/services/group_workspaces/esmeval/ ); @valeriupredoi is the admin of the group and if you apply for membership you should be approved in a jiffy; @jvegasbsc has recently allowed group write access to the repo;/mnt/lustre02/work/bd0854/DATA/ESMValTool2/OBS
; you will need membership to the group (contact @axel-lauer or @mattiarighi)Current issues
Q: How do we ensure that new data from Jasmin/esmeval gets mirrored to DKRZ? A: Each new dataset needs a CMORizer, which would need to go through a PR and a review. I always add new data to the DKRZ pool when testing the PRs. In that sense, DKRZ is the primary source and Jasmin the mirrored one...