EC-Earth / ece2cmor3

Post-processing and cmorization of ec-earth output
Apache License 2.0
13 stars 6 forks source link

cmorisation fails due to multiple loaded modules on HPC #789

Closed klauswyser closed 9 months ago

klauswyser commented 9 months ago

I updated ece2cmor3 this morning and also added the extra cc diagnostics. Then I try cmorising 1 leg of a test experiment with the latest output control files.

The LPJG error is pretty weird, why should there be a nemo subdriectory in lpjg?

The IFS error could possibly be related to #768, but not sure.

The model output (1 year) is available from http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11

treerink commented 9 months ago

Concerning the IFS part of your question:

When I look into your output/ifs/001/ dir I do see both files:

ICMGGCD05+000000
ICMSHCD05+000000

So that looks correct. For instance I have this kind of message as well when I run my test-all test without using restart files. It's a INFO message.

treerink commented 9 months ago

Omon fgco2 & Oyr fgco2 are only active in the ECE3-CC version (following the admin in ece2cmor3/resources/prefs.py). I have to admit that this variable is outside my normal test-all test with the AOGCM. It comes from the PISCES part. In the OptimESM request only Omon fgco2 is requested. It is not at all asked in the LPJG part?

So no idea, a unintended modifcation in the json file??

klauswyser commented 9 months ago

So no idea, a unintended modifcation in the json file??

@etiennesky and @nierad, any idea about @fgco2@? Presumably this variable is only relevant in emission driven experiments, and could then be saved by the CO2 box model, but not sure.

etiennesky commented 9 months ago

fgco2 is absolutely required in both concentration and emission-driven runs - it is the air-sea co2 flux and we need it as a diagnostic.

The issue with this variable is I had added a special code to combine fgco2 with fco2nat (in Lmon table) to create fco2nat (in Amon table), But I think we should drop this alltogether, and fco2nat is only from LPJG in Amon table directly.

etiennesky commented 9 months ago

so in summary I recommend we remove all references to fco2nat in lpjg cmorization code, I can do a MR if you want, along with adding support for co2box model.

@Klaus do you get fco2nat in Amon alongside fco2antt, despite the warning?

klauswyser commented 9 months ago

Concerning the IFS part of your question: [...] So that looks correct. For instance I have this kind of message as well when I run my test-all test without using restart files. It's a INFO message.

Sure, the IFS output ends with these INFO lines. However, the problem then is that processing just hangs, no output, not even any temporary files are created, nada. When following progress in the past I could see that as one of the first actions GRIB output files would be split into temporary files for the different vars, but now nothing seems to happen any longer, the temporary directory remains empty, the process just hangs.

etiennesky commented 9 months ago

@klaus if you cannot obtain fco2nat you can just replace this line

if outname=="fco2nat" and freq=="mon" and table=="Amon" and (cmor.get_cur_dataset_attribute("source_id") == "EC-Earth3-CC"):

with

if False and outname=="fco2nat" and freq=="mon" and table=="Amon" and (cmor.get_cur_dataset_attribute("source_id") == "EC-Earth3-CC"):

If this works, I think this simple change could be done to the ece2cmor package.

nierad commented 9 months ago

I updated ece2cmor3 this morning and also added the extra cc diagnostics. Then I try cmorising 1 leg of a test experiment with the latest output control files.

  • processing of lpjg fails with:
ERROR:ece2cmor3.lpjg2cmor: Cannot find any nemo output files in /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/nemo/
ERROR:ece2cmor3.lpjg2cmor: NEMO variable fgco2 needed for target fco2nat in table Amon was not found in nemo output... 
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3

The LPJG error is pretty weird, why should there be a nemo subdriectory in lpjg?

The IFS error could possibly be related to #768, but not sure.

The model output (1 year) is available from http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11

Hi @klauswyser , the NEMO comment is a legacy error from the guy who originally created it. I will remove/change it.

klauswyser commented 9 months ago

so in summary I recommend we remove all references to fco2nat in lpjg cmorization code, I can do a MR if you want, along with adding support for co2box model.

Yes, please go ahead and do the necessary changes. However, will this then disable something that we would need for emission driven runs?

@klaus do you get fco2nat in Amon alongside fco2antt, despite the warning?

So far I only got Amon/fco2antt, but not sure from which component. Definitely not from IFS.

goord commented 9 months ago

Hi @klauswyser I will have a look at the IFS issue, is there a varlist file somewhere?

klauswyser commented 9 months ago

Thanks @goord. The varlist is from the ECE repository: https://svn.ec-earth.org/ecearth3/branches/projects/optimesm/runtime/classic/ctrl/output-control-files/optimesm/optimesm-request-EC-EARTH-CC-varlist.json

treerink commented 9 months ago

The basic OptimESM request is also available here in ece2cmor (used by genecec).

Ok I did not understand it, as I looked from the data request perspective, and there is no fgco2 at the lpjg request list. I now see in the lpjg2cmor.py code some var hard coded statements like: if outname=="fco2nat" and freq=="mon" and table=="Amon" triggering the fgco2 request when Amon fco2nat is requested from the lpjg component. So with that the origin of the error message got clear (that it could happen at all).

So when Amon fco2nat is requested also Omon fgco2 is asked (well monthly fgco2, directly from the raw NEMO ECE output).

treerink commented 9 months ago

Concerning the IFS issue: ah ok hanging, I overlooked / forgot that comment. Hmm @plesager recently had also an issue with hanging IFS cmorisation on the KNMI HPC for FOCI if I remember correctly, while I could complete the same job at Bologna hpc2020 without any problem. And today I could run the test-all test at the KNMI HPC as well -- just sharing recent info. I can try the FOCI IFS cmorisation at the KNMI HPC whether I can repeat the hanging situation, didn't do that yet.

@klauswyser could you repeat with the updated OptimESM request (svn repo) where I took off the two fx variables, just to be sure they are not the cause?

treerink commented 9 months ago

Concerning the monthly LPJG Amon fco2nat: Would it be the easiest to just plainly cmorise LPJG Amon fco2nat and NEMO NEMO Omon fgco2 and run thereafter a tiny post-process script (nco or cdo) which combines the two and delivers the final Amon fco2nat with a note about this on the wiki recommended strategies?

treerink commented 9 months ago

FYI: The FOCI IFS/NEMO/TM5 cmorisation of one leg on the KNMI HPC did not hang for me, only the #783 issue stops the IFS cmorisation in the end (I left the fx areacella & fx sftlf by purpose in the request this time). Now repeating without these two fx variables in order to check I can complete a seamless IFS cmorisation of the FOCI request on the KNMI HPC as well. Anyway I seem not able here to repeat the hanging case.

@klauswyser Just a stupid question, did you wait long enough? With a substantial request the initial part takes rather long, though after a while you should see appear the filter messages (and you obviously do not report them to appear). Maybe share your submit script? Are there model level variables included?

klauswyser commented 9 months ago

@klauswyser Just a stupid question, did you wait long enough? With a substantial request the initial part takes rather long, though after a while you should see appear the filter messages (and you obviously do not report them to appear). Maybe share your submit script? Are there model level variables included?

Please see below the command that I'm using. No, I don't try to process model levels (haven't saved any in the output anyway). ANd I did wait for minutes, sometimes more than 10 mins but didn't get a single temporary file. And this morning I have updated the varlist.json file, the latest update removes the fx variables from IFS, but it doesn't help.

c=ifs
./ece2cmor.py --exp CD05 --varlist ~/optimesm/sm_wyser/ece3-optimesm5/runtime/classic/ctrl/output-control-files/optimesm/optimesm-request-EC-EARTH-CC-varlist.json --meta ~/optimesm/sm_wyser/ece3-optimesm5/runtime/classic/ctrl/output-control -files/optimesm/metadata-cmip6-CMIP-historical-EC-EARTH-CC-${c}-template.json --${c} --odir $HOME/optimesm/sm_wyser/ece2cmor3 ~/optimesm/sm_wyser/ece-run/CD05/output/${c}/001/ --tmpdir $SNIC_TMP 2>&1 | tee logfile_${c}
treerink commented 9 months ago

Before further investigation I would be a bit more patient and allow the job 2 hours or more to be sure or to get a clue. The FOCI cmorisation requires over 4 hours for both IFS and TM5 on two platforms per leg.

klauswyser commented 9 months ago

I now made a new test, waited for 2 hours. There is no progress at all, not one single temporary file, nothing.

goord commented 9 months ago

Hi @klauswyser if you can rerun with a single process, we can spot where the application hangs from the log file

treerink commented 9 months ago

Ok, clearly something going wrong. The link to 1 year of data in the upper post here concerns the data to test for this issue?

klauswyser commented 9 months ago

@treerink : yes, http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11 is the model output that I try to process

@goord : I tried with --npp 1 but don't get more information, it just hangs. Is there a debug mode in ece2cmor3?

treerink commented 9 months ago

What is the simple command line command to download this data set to my laptop?

klauswyser commented 9 months ago

wget -r

treerink commented 9 months ago

Do you have the exact example to download all of it or at least all of IFS. I get only a few kB of nonsense metadata or errors when I tried the link or the browser path to the ifs/001.

klauswyser commented 9 months ago

wget -r -e robots=off http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11/ifs/001

goord commented 9 months ago

I am able to reproduce the issue, also on my laptop the cmorization hangs

goord commented 9 months ago

For single process on my laptop, the cmorization went ok, odd.

klauswyser commented 9 months ago

I think the problem comes from some incompatibilty between modules and conda envs! ece2cmor.py works fine after unloading the eccodes module that I had loaded by default when logging in.

So far I have just tested one month with one variable, I'm now starting a full scale test with 1 year of OptimESM output.

klauswyser commented 9 months ago

Hooray, the test was passed successfully. It's not clear to me why the eccodes module all of a sudden should lead to problems, I had loaded this module all the time before, even for tests I did a few weeks ago. Neither eccodes nor python-eccodes have been updated recently. And it is not clear why Gijs could run with single but not with double precision. Two unsolved mysteries...

Gijs, does this help you to run ece2cmor3 on your laptop?

When I wrote "passed successfully" I mean the script processed 1 year of data. There were some error messages:

ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 248.128, level type 109, level -1. Dismissing task cl in table Amon
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 247.128, level type 109, level -1. Dismissing task cli in table Amon
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 246.128, level type 109, level -1. Dismissing task clw in table Amon
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 157.128, level type 109, level -1. Dismissing task hur in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 133.128, level type 109, level -1. Dismissing task hus in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 130.128, level type 109, level -1. Dismissing task ta in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 131.128, level type 109, level -1. Dismissing task ua in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 132.128, level type 109, level -1. Dismissing task va in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 135.128, level type 109, level -1. Dismissing task wap in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 129.128, level type 109, level -1. Dismissing task zg in table CFday

Doesn't level type 109 designate output on model level? If so it's obvious, I didn't save any model level variables. Would I need to set --skip_alevel_vars to get rid of these messages?

goord commented 9 months ago

Hi @klauswyser great news! Yes the modules on hpc systems may interfere with conda in unpredictable ways. So indeed it is essential to unload any module before loading the conda environment.

You're right those are model level variables and can be skipped with that flag

klauswyser commented 9 months ago

The IFS issue can be put ad acta, but the LPJG issue still exists:

ERROR:ece2cmor3.lpjg2cmor: Cannot find any nemo output files in /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/nemo/
ERROR:ece2cmor3.lpjg2cmor: NEMO variable fgco2 needed for target fco2nat in table Amon was not found in nemo output... 
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3

I tried yesterday with updated ece2cmor and updated tables and varlist, still get the same error. I thought @nierad had fixed it last week, maybe the fix hasn't been merged to the trunk yet?

treerink commented 9 months ago

Maybe @klauswyser change the subject of this issue to : cmorisation fails due to multiple loaded modules on HPC and close it.

For different issues please use a new or other issue ;)

treerink commented 9 months ago

I changed the subject.

The lesson learned is to be careful with multiple loaded modules on your system.

klauswyser commented 9 months ago

@treerink, why did you change the subject and closed the issue without consulting me first? The issue was about cmorising for OptimESM and included both IFS and LPJG problems. The IFS problem is solved but not the LPJG problem. In my eyes it's stupid to close unsolved issues and advice an author to open a new issue instead. This is good for statistics ("wow, see how many issues we have solved") but pisses off users that don't get a solution.

nierad commented 9 months ago

The IFS issue can be put ad acta, but the LPJG issue still exists:

ERROR:ece2cmor3.lpjg2cmor: Cannot find any nemo output files in /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/nemo/
ERROR:ece2cmor3.lpjg2cmor: NEMO variable fgco2 needed for target fco2nat in table Amon was not found in nemo output... 
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3

I tried yesterday with updated ece2cmor and updated tables and varlist, still get the same error. I thought @nierad had fixed it last week, maybe the fix hasn't been merged to the trunk yet?

Hi @klauswyser , Sorry, I wasn't quite clear on the fix: There was use of old nemo-code which I thought was misleading here. The actual issue is the demand for fgco2 which was introduced by @etiennesky I think. I will look into this right now.

nierad commented 9 months ago

@klauswyser : I have created a branch that should fix the issue with not finding the right nemo files. Would you be able to test it quickly? I do not have a full set up running at the moment. The branch is "789-import-fgco2-in-LPJG"

treerink commented 9 months ago

Concerning the monthly LPJG Amon fco2nat: Would it be the easiest to just plainly cmorise LPJG Amon fco2nat and NEMO NEMO Omon fgco2 and run thereafter a tiny post-process script (nco or cdo) which combines the two and delivers the final Amon fco2nat with a note about this on the wiki recommended strategies?

This seems to me still by far the easiest solution for delivering Amon fco2nat.

nierad commented 9 months ago

@treerink : Well, I don't know:)It should work now and this is only ever supposed to happen in CC runs, no?

klauswyser commented 9 months ago

The cmorisation of LPJG output fails with Lars' branch that addresses reading of fgco2 from nemo:

INFO:ece2cmor3.taskloader: Created 137 ece2cmor tasks from input variable list.
INFO:ece2cmor3.ece2cmorlib: Selected 137 LPJG tasks from 137 input tasks
INFO:ece2cmor3.lpjg2cmor: Executing 137 lpjg tasks...
INFO:ece2cmor3.lpjg2cmor: Cmorizing lpjg tasks...
Traceback (most recent call last):
  File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/./ece2cmor.py", line 157, in <module>
    main()
  File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/./ece2cmor.py", line 146, in main
    ece2cmorlib.perform_lpjg_tasks(args.datadir, args.tmpdir, args.exp, refdate)
  File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/ece2cmorlib.py", line 244, in perform_lpjg_tasks
    lpjg2cmor.execute(lpjg_tasks)
  File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/lpjg2cmor.py", line 224, in execute
    if not check_time_resolution(lpjgfile, freq):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/lpjg2cmor.py", line 333, in check_time_resolution
    elif freq.startswith("yr"):
         ^^^^^^^^^^^^^^^^^^^^^
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

It seems the fgco2 error is gone, maybe a new (different problem)?

However, in the meantime something else seems to have happened with the master branch since yesterday:

INFO:ece2cmor3.taskloader: Created 139 ece2cmor tasks from input variable list.
INFO:ece2cmor3.ece2cmorlib: Selected 139 LPJG tasks from 139 input tasks
INFO:ece2cmor3.lpjg2cmor: Executing 139 lpjg tasks...
INFO:ece2cmor3.lpjg2cmor: Cmorizing lpjg tasks...
INFO:ece2cmor3.lpjg2cmor: Processing file /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/001/fco2antt_monthly.out
INFO:ece2cmor3.lpjg2cmor: Creating lpjg netcdf file for variable fco2antt for year 2000
INFO:ece2cmor3.lpjg2cmor: CMORizing variable fco2antt in table Amon from fco2antt in file fco2antt.out...
[...]
INFO:ece2cmor3.lpjg2cmor: Processing file /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/001/fco2nat_monthly.out
INFO:ece2cmor3.lpjg2cmor: Creating lpjg netcdf file for variable fco2nat for year 2000
INFO:ece2cmor3.lpjg2cmor: Using the following cdo version for conservative remapping
Climate Data Operators version 2.3.0 (https://mpimet.mpg.de/cdo)
System: x86_64-conda-linux-gnu
CXX Compiler: /home/conda/feedstock_root/build_artifacts/cdo_1697853988338/_build_env/bin/x86_64-conda-linux-gnu-c++ -fPIC -DPIC -g -O2 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/sm_wyser/.conda/envs/ece2cmor3/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/cdo_1697853988338/work=/usr/local/src/conda/cdo-2.3.0 -fdebug-prefix-map=/home/sm_wyser/.conda/envs/ece2cmor3=/usr/local/src/conda-prefix -fopenmp -pthread
[...]
cdo    add (Abort): Latitude orientation differ! First grid: N->S; second grid: S->N
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding remapped fgco2 variable from nemo output file /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/nemo/001/CD05_1m_20000101_20001231_pisces_grid_T_2D.nc to fco2nat in table Amon...
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3

To process fgco2 cdo is called for remapping (the nemo output?), but it fails. Add -invertlat? Full logfile

Should we follow Thomas' advice and open a new dedicated issue for this? In addition to this fgco2 issue (hopefully solved soon) there are still a dozen or so error messages about the new LPJGday and LPJGmon tables.

nierad commented 9 months ago

Hi @klauswyser , yes, maybe this does deserve a new issue. The "startswith()" issue is incredibly weird. It just disappeared when I checked out a new branch and I also thin, that the error message is wrong if you look at the definition of it...

EDIT: Yes, the invertlat might fix it.

I wonder if this ever worked before...?

nierad commented 9 months ago

@klauswyser , I have commited two fixes: one for the "startswith" issue and for now I have removed the "invertlat" in the cdo-command for fgco2. I am not sure, though, that this is enough. If this doesn't fix it I will open a new issue.

treerink commented 9 months ago

My general preference is one problem per issue (not an iron format), but with that it is easier to solve issues one by one.

Lars I think you branched off from a not update master. I tried to merge the master in your branch but meanwhile you made a commit, so I now first have to solve the git mixed up state.

nierad commented 9 months ago

hmm, now I start to see the downsides of git:) Shall I restartt my changes. and delete the branch?


Von: Thomas Reerink @.> Gesendet: Dienstag, 6. Februar 2024 12:51 An: EC-Earth/ece2cmor3 @.> Cc: Lars Nieradzik @.>; State change @.> Betreff: Re: [EC-Earth/ece2cmor3] cmorisation fails due to multiple loaded modules on HPC (Issue #789)

My general preference is one problem per issue (not an iron format), but with that it is easier to solve issues one by one.

Lars I think you branched off from a not update master. I tried to merge the master in your branch but meanwhile you made a commit, so I now first have to solve the git mixed up state.

— Reply to this email directly, view it on GitHubhttps://github.com/EC-Earth/ece2cmor3/issues/789#issuecomment-1929357687, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJA4D34R6BYBM2I4KCQ7XALYSIKNZAVCNFSM6AAAAABCS2FL4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGM2TONRYG4. You are receiving this because you modified the open/close state.Message ID: @.***>

treerink commented 9 months ago

No just do another pull in your branch, I have merged the master a second ago.

nierad commented 9 months ago

Ok, done!

Thanks and sorry!


Von: Thomas Reerink @.> Gesendet: Dienstag, 6. Februar 2024 12:57 An: EC-Earth/ece2cmor3 @.> Cc: Lars Nieradzik @.>; State change @.> Betreff: Re: [EC-Earth/ece2cmor3] cmorisation fails due to multiple loaded modules on HPC (Issue #789)

No just do another pull in your branch, I have merged the master a second ago.

— Reply to this email directly, view it on GitHubhttps://github.com/EC-Earth/ece2cmor3/issues/789#issuecomment-1929366132, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJA4D3YJDTRFQDN32ENYH5LYSILCHAVCNFSM6AAAAABCS2FL4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGM3DMMJTGI. You are receiving this because you modified the open/close state.Message ID: @.***>