ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
211 stars 124 forks source link

Recipe testing and comparison for release 2.7.0 #2881

Closed valeriupredoi closed 1 year ago

valeriupredoi commented 1 year ago

Sister and logical evolution of #2852 - I am commencing testing and comparison of recipes and recipes results in order to release 2.7.0 at the end of this week (hopefully). System parameters below, work done on DKRZ/Levante: submit files in /home/b/b382109/submit, output in /scratch/b/b382109/esmvaltool_output

System and settings

conda/mamba

(base) mamba --version
mamba 0.27.0
conda 22.9.0

Git branch and state

Date: 25 October 2022 14:22 BST

(base) git status
On branch release_270stable
Your branch is up to date with 'origin/release_270stable'.

nothing to commit, working tree clean

Environment

On Levante:

mamba env create -n tool270Test -f environment.yml
conda activate tool270Test

Environment file

ToolEnv270Test.yml

Extraneous file movements

I moved the autoassess-specific files to /home/b/b382109/autoassess_files - run was succesful for AA recipes then :+1:

Ad-hoc hacks (code changes)

Mods to config user file

Added DKRZ downloaded data pool as:

  CMIP6:
    - /work/bd0854/DATA/ESMValTool2/CMIP6_DKRZ
    - /work/bd0854/DATA/ESMValTool2/download/CMIP6
  CMIP5:
    - /work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ
    - /work/bd0854/b309141/additional_CMIP5
    - /work/bd0854/DATA/ESMValTool2/download/cmip5/output1
    - /work/bd0854/DATA/ESMValTool2/download/cmip5

as @schlunma and @remi-kazeroni have suggested :beer:

Recipe runs

Recipe runs results (as of final on 27 October 2022) are listed in https://github.com/ESMValGroup/ESMValTool/issues/2881#issuecomment-1291878142 (with very many thanks to @remi-kazeroni for running the impossible to run ones!) and are as follows:

(*) means not counting/counting the one that had a DiagnosticError but was fixed but not PR-ed

Running the comparison

Login and access to the DKRZ esmvaltool VM

Results from recipe runs are stored on the VM; login with:

ssh youraccount@esmvaltool.dkrz.de

Get and install miniconda on VM

E.g. scp Miniconda3-py39_4.12.0-Linux-x86_64.sh b382109@esmvaltool.dkrz.de:~ from a file already on Levante.

Setting up the input files

If you wrote recipe runs output to Levante /scratch partition be aware that the data will be removed after two weeks, so you will have to move the output data to the /work partition, via e.g. a nohup job:

nohup cp -r /scratch/b/b382109/esmvaltool_output/* /work/bd0854/b382109/v270

/work is visible by the VM so you can run the compare tool straight on the VM.

NOTE do not store final release results on the VM including /preproc/ dirs, the total size for all the recipes output, including /preproc/ dirs is in the 4.5TB ballpark, much too high for the VM storage capacity

Running compare tool at VM

Input/output/run

Sanity check, as outputted by compare.py

Comparing recipe run(s) in:
/work/bd0854/b382109/v270
to reference in /mnt/esmvaltool_disk2/shared/esmvaltool/v2.6.0rc4

First pass result

Running the compare.py results in a few recipes not-OK (NOK) wrt plots differing from previous release v2.6.0, summary in https://github.com/ESMValGroup/ESMValTool/issues/2881#issuecomment-1294735465

Detailed plots inspection

Plots that differ for the 34 recipes that have them different is happening in https://github.com/ESMValGroup/ESMValTool/issues/2881#issuecomment-1295001054

valeriupredoi commented 1 year ago

@sloosvel I am in dire pain after realizing blithering DKRZ's SLURM emails me for every recipe :face_with_spiral_eyes:

valeriupredoi commented 1 year ago

@sloosvel what's these jobs up to?

(tool270Test) squeue -u b382109
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2378977   compute recipe_z  b382109 PD       0:00      1 (AssocMaxJobsLimit)
           2378976   compute recipe_w  b382109 PD       0:00      1 (AssocMaxJobsLimit)
           2378975   compute recipe_w  b382109 PD       0:00      1 (AssocMaxJobsLimit)
           2378974   compute recipe_w  b382109 PD       0:00      1 (AssocMaxJobsLimit)
sloosvel commented 1 year ago

@sloosvel I am in dire pain after realizing blithering DKRZ's SLURM emails me for every recipe face_with_spiral_eyes

You can comment that if it's not useful to you, to me it was!

@sloosvel what's these jobs up to?

I think there is a limit in number of jobs an account can run simultaneously in levante. They will be pending until other jobs finish I guess

remi-kazeroni commented 1 year ago

@sloosvel what's these jobs up to?

On Levante, a user can't have more than 20 Slurm jobs running at a time. As soon as a job is finished, the next one should start

valeriupredoi commented 1 year ago

They will be pending until other jobs finish I guess

Cheers! More emails then :man_facepalming: :rofl:

valeriupredoi commented 1 year ago

OK guys - first (and only) sbatch session over on Levante (I have one stray recipe still running, it's a zombie though) and this is how it looks:

Recipe running session 2022-10-26 13:13:41.568698

Succesfully run recipes

122 out of 127 final

Recipes that failed with DiagnosticError

0 out of 127 (1 fixed, not PR-ed yet)

Recipes that failed of Missing Data

2 out of 127 final

Recipes that failed of other reasons

3 out of 127 final

Obsolete/resolved issues comment:

The Julia ones are totally my bad - forgot to install Julia after installing esmvaltool, the autoassess ones are either of the old bug that @alistairsellar is fixing now, or they need aux data that is only on JASMIN, the ones of Missing Data are bothering me badly - since I have turned on auto downloads but they are still missing data, what do you guys recommend doing about those? @sloosvel @remi-kazeroni @bouweandela ? I will post detailed postmortems for the ones that have failed for odd reasons below :+1:

valeriupredoi commented 1 year ago

Postmortem of failed recipes OTHER THAN Missing Data

Recipes that failed with DiagnosticError

0 out of 127 (1 fixed, not yet PR-ed)

Recipes that failed of other reasons or are still running

1 out of 127

remi-kazeroni commented 1 year ago

Hi @valeriupredoi, great job with the testing! I forgot to mention but we have a central pool of downloaded data on Levante at /work/bd0854/DATA/ESMValTool2/download/CMIP6, /work/bd0854/DATA/ESMValTool2/download/cmip5/output1, and /work/bd0854/DATA/ESMValTool2/download/cmip5/output1. Maybe you could add those to your path on top of your download directory? This should help solving the time limit issues (lots of fx files searched on ESGF and/or downloaded I guess).

remi-kazeroni commented 1 year ago

recipe_smpi.yml - too slow Elapsed time : 04:00:19 (Timelimit=04:00:00)

For this one, I would recommend using:

#SBATCH --partition=compute
#SBATCH --time=08:00:00
#SBATCH --constraint=512G
valeriupredoi commented 1 year ago

Indeed, cheers @remi-kazeroni - smpi is a memory gobbler - I restarted it on SLURM and promptly got kicked out coz mem limit (this time around I think all data has been downloaded, hence it went to intensive processing). I'll resubmit with mem reqs. What do you recommend about those that really-really are missing data?

valeriupredoi commented 1 year ago

recipe_smpi.yml - too slow Elapsed time : 04:00:19 (Timelimit=04:00:00)

For this one, I would recommend using:

#SBATCH --partition=compute
#SBATCH --time=08:00:00
#SBATCH --constraint=512G

even with 512G still fails out of MEM :open_mouth:

valeriupredoi commented 1 year ago

oh crap, forgot to change the partition :face_in_clouds:

remi-kazeroni commented 1 year ago

recipe_smpi.yml - too slow Elapsed time : 04:00:19 (Timelimit=04:00:00)

For this one, I would recommend using:

#SBATCH --partition=compute
#SBATCH --time=08:00:00
#SBATCH --constraint=512G

even with 512G still fails out of MEM 😮

You can try with 1024G then! But that's the highest available

valeriupredoi commented 1 year ago

recipe_smpi.yml - too slow Elapsed time : 04:00:19 (Timelimit=04:00:00)

For this one, I would recommend using:

#SBATCH --partition=compute
#SBATCH --time=08:00:00
#SBATCH --constraint=512G

even with 512G still fails out of MEM open_mouth

You can try with 1024G then! But that's the highest available

totally user-side - forgot to change the partition to compute - cheers, dude! :beer:

sloosvel commented 1 year ago

I never managed to run the smpi recipes, @remi-kazeroni did it for me in the last release. Maybe the batch script settings for this recipe can be changed in #2883

valeriupredoi commented 1 year ago

with correct SLURM settings as recommended by @remi-kazeroni (:beer:) those smpi monsters are happily plodding along now - yes, we should change the settings for sure. @sloosvel how did you fix the runs for those recipes that really-really dont have data, like I found in https://github.com/ESMValGroup/ESMValTool/issues/2881#issuecomment-1291878142

remi-kazeroni commented 1 year ago

I don't have a definitive answer for the really-really missing data cases. As said in this comment, you could try to rerun the recipes adding these paths to you config file. But that data pool is 2 releases old. One could argue that we should delete it and re-download everything as /work/bd0854/DATA/ESMValTool2/download/ may contain data retracted from ESGF...

Taking a closer look at some of these (currently) 13 cases:

sloosvel commented 1 year ago

I think for recipe_climate_change_hotspot.ym, I ended up running it on jasmin

valeriupredoi commented 1 year ago

Hi @remi-kazeroni @sloosvel awesome, thanks a lot! Here's the thing(s):

I'll have a closer look at the meeh and schnlund ones, and will ping @schlunma asap

katjaweigel commented 1 year ago

Yes, the version of recipe_flato13ipcc.yml currently in #2156 is running. The cost is to remove/comment out data sets, which do not work on Levante (and to fix a wrong time period for one model). There was already some discussion on how to deal with such cases, and if I remember right @axel-lauer , who is maintainer of the original recipe_flato13ipcc.yml did not agree on removing data sets? It should also be noted, that the option --skip_nonexistent does not work for all diagnostics in recipe_flato13ipcc.yml, because in several data sets from e.g. two different experiments are needed and it does not work, if only one is there. Therefore I was going to ask, which version of recipe_flato13ipcc.yml should be in the end in #2156 in this issue. (Unfortunately I'm also not completely ready with some issues in recipe_flato13ipcc_figures_938_941.yml I hope to finish them soon).

schlunma commented 1 year ago

V, can adapt the permission to /scratch/b/b382109/esmvaltool_output so I can have a look at the logs?

valeriupredoi commented 1 year ago

/scratch/b/b382109/esmvaltool_output

@schlunma Manu, they are here /home/b/b382109/manu_logs

valeriupredoi commented 1 year ago

Yes, the version of recipe_flato13ipcc.yml currently in #2156 is running. The cost is to remove/comment out data sets, which do not work on Levante (and to fix a wrong time period for one model). There was already some discussion on how to deal with such cases, and if I remember right @axel-lauer , who is maintainer of the original recipe_flato13ipcc.yml did not agree on removing data sets? It should also be noted, that the option --skip_nonexistent does not work for all diagnostics in recipe_flato13ipcc.yml, because in several data sets from e.g. two different experiments are needed and it does not work, if only one is there. Therefore I was going to ask, which version of recipe_flato13ipcc.yml should be in the end in #2156 in this issue. (Unfortunately I'm also not completely ready with some issues in recipe_flato13ipcc_figures_938_941.yml I hope to finish them soon).

@katjaweigel many thanks for your clarification! I will consider this recipe at-risk for now, and will not faff about it until you guys fix it - not the first and not the last time we include not really fully working recipes in a release :grin:

schlunma commented 1 year ago

cd: permission denied: /home/b/b382109/manu_logs :cry:

valeriupredoi commented 1 year ago

cd: permission denied: /home/b/b382109/manu_logs cry

bugger! :face_exhaling: Here they are, bud

meeh_log.txt schlund20esd_log.txt

schlunma commented 1 year ago

I just manually searched for the files on DKRZ's ESGF node and found all files. Not sure what's going on there, but as @remi-kazeroni I would recommend adding our shared pool (/work/bd0854/DATA/ESMValTool2/download/CMIP6) to your config-user.yml file :+1:

schlunma commented 1 year ago
* recipe_anav13jclim.yml - this is not optimal if "special" cmip5 data is needed, that is not available on ESGF - I would add this recipe to the list of those we have to see what to do about it wrt obsolete data

It's not "special" CMIP5 data, it's rather that our DRS do not take output into account. See discussion here: https://github.com/ESMValGroup/ESMValTool/issues/2408#issuecomment-1049955903

valeriupredoi commented 1 year ago

I just manually searched for the files on DKRZ's ESGF node and found all files. Not sure what's going on there, but as @remi-kazeroni I would recommend adding our shared pool (/work/bd0854/DATA/ESMValTool2/download/CMIP6) to your config-user.yml file +1

cheers, bud! Added, firing those up now :rocket:

valeriupredoi commented 1 year ago

OK I added the extra paths and some of them recipes have started plodding along, still - a few doggedly refuse to run still complaining of missing data:

- recipe_anav13jclim.yml
- recipe_check_obs.yml
- recipe_climate_change_hotspot.yml
- recipe_climwip_brunner2019_med.yml
- recipe_collins13ipcc.yml
- recipe_seaice.yml

I'll have to see about running those on JASMIN

schlunma commented 1 year ago

recipe_anav13jclim.yml needs

CMIP5: /work/bd0854/DATA/ESMValTool2/download/cmip5

in addition to the default paths.

valeriupredoi commented 1 year ago

FFS man - what's this - we're gathering data like they're sheep on a field in Wales? What are we gonna do about this - suboptimal data storage to put it politely :angry:

remi-kazeroni commented 1 year ago

I'm rerunning all recipes listed as "Recipes that failed of Missing Data" in this comment and the 2 recipe_bock20jgrfig* listed in the section below but using previously downloaded data in /work/bd0854/DATA/ESMValTool2/download/. Recipe runs can be found in /scratch/b/b309192/esmvaltool_output. Current status:

Running successfully

Failed recipes

@valeriupredoi, feel free to grab the successful runs and put them in your directory.

EDIT: 4 more successes

valeriupredoi commented 1 year ago

Anav dies yet again even with that extra data source :man_facepalming: - JASMIN it is for these showstoppers, only JASMIN is slow like a snail :snail:

schlunma commented 1 year ago

This is just our default download directory on Levante for data that has not been provided by DKRZ directly, Remi mentioned that here.

And as mentioned in my previous comment, anav13 is a special case since data from output2 cannot be read with the default DRS (our fault, not CMIPs!). You could also try:

CMIP5: /work/bd0854/DATA/ESMValTool2/download/cmip5/output2
valeriupredoi commented 1 year ago

@remi-kazeroni that's brilliant! How you managed to get seaice to run is a true mystery, mine failed like 4 times in the past hour :laughing: - what's your path, bud?

remi-kazeroni commented 1 year ago

FFS man - what's this - we're gathering data like they're sheep on a field in Wales? What are we gonna do about this - suboptimal data storage to put it politely 😠

Yeah, I know this is not optimal. But this is the directory in which several developers working on Levante download automatically their data to avoid having too many copies of the same datasets... Maybe we should revisit that for the next release and not use previously downloaded data.

remi-kazeroni commented 1 year ago

@remi-kazeroni that's brilliant! How you managed to get seaice to run is a true mystery, mine failed like 4 times in the past hour 😆 - what's your path, bud?

For the data:

rootpath:
  CMIP6: [/work/ik1017/CMIP6/data/CMIP6, /work/bd0854/DATA/ESMValTool2/download/CMIP6]
  CMIP5: [/work/kd0956/CMIP5/data/cmip5/output1/, /work/bd0854/DATA/ESMValTool2/download/cmip5/output1, /work/bd0854/DATA/ESMValTool2/download/cmip5/output2]

For the runs: /scratch/b/b309192/esmvaltool_output. So you have the seaice in: /scratch/b/b309192/esmvaltool_output/recipe_seaice_20221026_142333

valeriupredoi commented 1 year ago

This is just our default download directory on Levante for data that has not been provided by DKRZ directly, Remi mentioned that here.

And as mentioned in my previous comment, anav13 is a special case since data from output2 cannot be read with the default DRS (our fault, not CMIPs!). You could also try:

CMIP5: /work/bd0854/DATA/ESMValTool2/download/cmip5/output2

We need to get DKRZ on board to organize/populate their data in their ESGF node - this is a hot mess as it is right now - you guys and me are scraping for data like mad. JASMIN has it much better organized, only problem is JASMIN is abysmally slow compared to Levante and lacking memory. If Levante and Jasmin made a baby, then that'd be perfect :grin:

valeriupredoi commented 1 year ago

very many thanks, chaps! I'll let those run (both on me and Remi's partitions) and am off home, tomorrow I'll pick up the results, with all these "missing data" in (or most of them) I'll be able to run the comparison tomorrow - we're on track still :train2:

valeriupredoi commented 1 year ago

@schlunma meeh ran fine with the extra Welsh sheep data, bud! Mee-ha! Off to make dinner 🍕

valeriupredoi commented 1 year ago

OK guys final count 122/127 recipes successfully run - we are legend :beer: Now, on to the comparison dread :grin:

valeriupredoi commented 1 year ago

@sloosvel after trying to make the comparison script work for a bit of a while (not the most straightforwardly code and with quite a few missed catches, @bouweandela - sorry) I have realized that your runs in /scratch/b/b381943/esmvaltool_output are all empty shells - dir structure is there, but no files at all - can you please tell me what's going on? ASAP, please

valeriupredoi commented 1 year ago

@sloosvel also you ran a whole lot of other recipes on top of the standard ESMValTool release ones there - is there anywhere else where you moved the 2.6.0 release output? (one of the things that tripped the compare script)

sloosvel commented 1 year ago

The outputs you need are in the esmvaltool VM: https://esmvaltool.dkrz.de/shared/esmvaltool/v2.6.0rc4/ . I ran the comparison tool in there because it's where outputs from other versions are. The other outputs are personal work, no need to compare them!

valeriupredoi commented 1 year ago

OK cool! How do I get access to those files via a terminal please - I need to run the compare tool via command line, or is there any other way to do that? :beer:

sloosvel commented 1 year ago

You can log in the machine using your levante credentials: ssh youraccount@esmvaltool.dkrz.de

First move your outputs in /shared/esmvaltool/ (I used rsync excluding the preproc folder for all outputs) and then run the comparison tool against the output for other versions

valeriupredoi commented 1 year ago

Thanks - I will do that now. But I am confused by the lack of standardization and disk backuping - I reckon this shouldn't be done for the next release, and the RM should keep the data on the actual Levante disk too eg I am looking at the output from https://esmvaltool.dkrz.de/shared/esmvaltool/v2.6.0/debug.html and see files listed eg /scratch/b/b381943/esmvaltool_output/recipe_autoassess_landsurface_permafrost_20220712_101605/preproc/aa_landsurf_permafrost/tas/CMIP6_ACCESS-CM2_Amon_historical_r1i1p1f1_tas_gn_1992-2002.nc - those don't exist, what happened to them? Are they behind a virtual OS layer?

valeriupredoi commented 1 year ago

OK this is suboptimal to the very least - I managed to get into the VM but it's barren - I need to create a conda env, and depending on the deps (hopefully not many have changed since two days ago when I created the testing env), we may get different results based on what deps the compare script ingests - I may have to use a condalock for that from the actual Levante env, let alone very heavy data duplication (those outputs even without the preproc/ dirs, which I didn't output anyway, are not small). Why didn't you keep the output on Levante, or run there in the first place?

sloosvel commented 1 year ago

Because I don't have the output for other versions in Levante. Other releases were ran in Mistral. The outputs are in the virtual machine. And it's indicated in the documentation anyway: https://docs.esmvaltool.org/en/latest/utils.html#comparing-recipe-runs

valeriupredoi commented 1 year ago

and on top of this all I don't have write permissions to /shared/esmvaltool to move the data - I really don't want to move it in my $HOME on the VM first and then move them back or symlink them; @remi-kazeroni who is supposed to give me rwx+ to that partition please?