Closed klauswyser closed 9 months ago
Concerning the IFS part of your question:
When I look into your output/ifs/001/
dir I do see both files:
ICMGGCD05+000000
ICMSHCD05+000000
So that looks correct. For instance I have this kind of message as well when I run my test-all
test without using restart files. It's a INFO message.
Omon fgco2
& Oyr fgco2
are only active in the ECE3-CC version (following the admin in ece2cmor3/resources/prefs.py
). I have to admit that this variable is outside my normal test-all
test with the AOGCM. It comes from the PISCES part. In the OptimESM request only Omon fgco2
is requested. It is not at all asked in the LPJG part?
So no idea, a unintended modifcation in the json file??
So no idea, a unintended modifcation in the json file??
@etiennesky and @nierad, any idea about @fgco2@? Presumably this variable is only relevant in emission driven experiments, and could then be saved by the CO2 box model, but not sure.
fgco2 is absolutely required in both concentration and emission-driven runs - it is the air-sea co2 flux and we need it as a diagnostic.
The issue with this variable is I had added a special code to combine fgco2 with fco2nat (in Lmon table) to create fco2nat (in Amon table), But I think we should drop this alltogether, and fco2nat is only from LPJG in Amon table directly.
so in summary I recommend we remove all references to fco2nat in lpjg cmorization code, I can do a MR if you want, along with adding support for co2box model.
@Klaus do you get fco2nat in Amon alongside fco2antt, despite the warning?
Concerning the IFS part of your question: [...] So that looks correct. For instance I have this kind of message as well when I run my
test-all
test without using restart files. It's a INFO message.
Sure, the IFS output ends with these INFO lines. However, the problem then is that processing just hangs, no output, not even any temporary files are created, nada. When following progress in the past I could see that as one of the first actions GRIB output files would be split into temporary files for the different vars, but now nothing seems to happen any longer, the temporary directory remains empty, the process just hangs.
@klaus if you cannot obtain fco2nat you can just replace this line
if outname=="fco2nat" and freq=="mon" and table=="Amon" and (cmor.get_cur_dataset_attribute("source_id") == "EC-Earth3-CC"):
with
if False and outname=="fco2nat" and freq=="mon" and table=="Amon" and (cmor.get_cur_dataset_attribute("source_id") == "EC-Earth3-CC"):
If this works, I think this simple change could be done to the ece2cmor package.
I updated ece2cmor3 this morning and also added the extra cc diagnostics. Then I try cmorising 1 leg of a test experiment with the latest output control files.
- processing of lpjg fails with:
ERROR:ece2cmor3.lpjg2cmor: Cannot find any nemo output files in /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/nemo/ ERROR:ece2cmor3.lpjg2cmor: NEMO variable fgco2 needed for target fco2nat in table Amon was not found in nemo output... ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3
The LPJG error is pretty weird, why should there be a nemo subdriectory in lpjg?
The IFS error could possibly be related to #768, but not sure.
The model output (1 year) is available from http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11
Hi @klauswyser , the NEMO comment is a legacy error from the guy who originally created it. I will remove/change it.
so in summary I recommend we remove all references to fco2nat in lpjg cmorization code, I can do a MR if you want, along with adding support for co2box model.
Yes, please go ahead and do the necessary changes. However, will this then disable something that we would need for emission driven runs?
@klaus do you get fco2nat in Amon alongside fco2antt, despite the warning?
So far I only got Amon/fco2antt, but not sure from which component. Definitely not from IFS.
Hi @klauswyser I will have a look at the IFS issue, is there a varlist file somewhere?
Thanks @goord. The varlist is from the ECE repository: https://svn.ec-earth.org/ecearth3/branches/projects/optimesm/runtime/classic/ctrl/output-control-files/optimesm/optimesm-request-EC-EARTH-CC-varlist.json
The basic OptimESM request is also available here in ece2cmor (used by genecec
).
Ok I did not understand it, as I looked from the data request perspective, and there is no fgco2
at the lpjg request list. I now see in the lpjg2cmor.py
code some var hard coded statements like:
if outname=="fco2nat" and freq=="mon" and table=="Amon"
triggering the fgco2
request when Amon fco2nat
is requested from the lpjg component. So with that the origin of the error message got clear (that it could happen at all).
So when Amon fco2nat
is requested also Omon fgco2
is asked (well monthly fgco2
, directly from the raw NEMO ECE output).
Concerning the IFS issue: ah ok hanging, I overlooked / forgot that comment. Hmm @plesager recently had also an issue with hanging IFS cmorisation on the KNMI HPC for FOCI if I remember correctly, while I could complete the same job at Bologna hpc2020 without any problem. And today I could run the test-all
test at the KNMI HPC as well -- just sharing recent info. I can try the FOCI IFS cmorisation at the KNMI HPC whether I can repeat the hanging situation, didn't do that yet.
@klauswyser could you repeat with the updated OptimESM request (svn repo) where I took off the two fx
variables, just to be sure they are not the cause?
Concerning the monthly LPJG Amon fco2nat
: Would it be the easiest to just plainly cmorise LPJG Amon fco2nat
and NEMO NEMO Omon fgco2
and run thereafter a tiny post-process script (nco or cdo) which combines the two and delivers the final Amon fco2nat
with a note about this on the wiki recommended strategies?
FYI: The FOCI IFS/NEMO/TM5 cmorisation of one leg on the KNMI HPC did not hang for me, only the #783 issue stops the IFS cmorisation in the end (I left the fx areacella
& fx sftlf
by purpose in the request this time). Now repeating without these two fx
variables in order to check I can complete a seamless IFS cmorisation of the FOCI request on the KNMI HPC as well. Anyway I seem not able here to repeat the hanging case.
@klauswyser Just a stupid question, did you wait long enough? With a substantial request the initial part takes rather long, though after a while you should see appear the filter messages (and you obviously do not report them to appear). Maybe share your submit script? Are there model level variables included?
@klauswyser Just a stupid question, did you wait long enough? With a substantial request the initial part takes rather long, though after a while you should see appear the filter messages (and you obviously do not report them to appear). Maybe share your submit script? Are there model level variables included?
Please see below the command that I'm using. No, I don't try to process model levels (haven't saved any in the output anyway). ANd I did wait for minutes, sometimes more than 10 mins but didn't get a single temporary file. And this morning I have updated the varlist.json
file, the latest update removes the fx
variables from IFS, but it doesn't help.
c=ifs
./ece2cmor.py --exp CD05 --varlist ~/optimesm/sm_wyser/ece3-optimesm5/runtime/classic/ctrl/output-control-files/optimesm/optimesm-request-EC-EARTH-CC-varlist.json --meta ~/optimesm/sm_wyser/ece3-optimesm5/runtime/classic/ctrl/output-control -files/optimesm/metadata-cmip6-CMIP-historical-EC-EARTH-CC-${c}-template.json --${c} --odir $HOME/optimesm/sm_wyser/ece2cmor3 ~/optimesm/sm_wyser/ece-run/CD05/output/${c}/001/ --tmpdir $SNIC_TMP 2>&1 | tee logfile_${c}
Before further investigation I would be a bit more patient and allow the job 2 hours or more to be sure or to get a clue. The FOCI cmorisation requires over 4 hours for both IFS and TM5 on two platforms per leg.
I now made a new test, waited for 2 hours. There is no progress at all, not one single temporary file, nothing.
Hi @klauswyser if you can rerun with a single process, we can spot where the application hangs from the log file
Ok, clearly something going wrong. The link to 1 year of data in the upper post here concerns the data to test for this issue?
@treerink : yes, http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11 is the model output that I try to process
@goord : I tried with --npp 1
but don't get more information, it just hangs. Is there a debug mode in ece2cmor3?
What is the simple command line command to download this data set to my laptop?
wget -r
Do you have the exact example to download all of it or at least all of IFS. I get only a few kB of nonsense metadata or errors when I tried the link or the browser path to the ifs/001.
wget -r -e robots=off http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11/ifs/001
I am able to reproduce the issue, also on my laptop the cmorization hangs
For single process on my laptop, the cmorization went ok, odd.
I think the problem comes from some incompatibilty between modules and conda envs! ece2cmor.py works fine after unloading the eccodes module that I had loaded by default when logging in.
So far I have just tested one month with one variable, I'm now starting a full scale test with 1 year of OptimESM output.
Hooray, the test was passed successfully. It's not clear to me why the eccodes module all of a sudden should lead to problems, I had loaded this module all the time before, even for tests I did a few weeks ago. Neither eccodes nor python-eccodes have been updated recently. And it is not clear why Gijs could run with single but not with double precision. Two unsolved mysteries...
Gijs, does this help you to run ece2cmor3 on your laptop?
When I wrote "passed successfully" I mean the script processed 1 year of data. There were some error messages:
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 248.128, level type 109, level -1. Dismissing task cl in table Amon
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 247.128, level type 109, level -1. Dismissing task cli in table Amon
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 246.128, level type 109, level -1. Dismissing task clw in table Amon
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 157.128, level type 109, level -1. Dismissing task hur in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 133.128, level type 109, level -1. Dismissing task hus in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 130.128, level type 109, level -1. Dismissing task ta in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 131.128, level type 109, level -1. Dismissing task ua in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 132.128, level type 109, level -1. Dismissing task va in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 135.128, level type 109, level -1. Dismissing task wap in table CFday
ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 129.128, level type 109, level -1. Dismissing task zg in table CFday
Doesn't level type 109
designate output on model level? If so it's obvious, I didn't save any model level variables. Would I need to set --skip_alevel_vars
to get rid of these messages?
Hi @klauswyser great news! Yes the modules on hpc systems may interfere with conda in unpredictable ways. So indeed it is essential to unload any module before loading the conda environment.
You're right those are model level variables and can be skipped with that flag
The IFS issue can be put ad acta, but the LPJG issue still exists:
ERROR:ece2cmor3.lpjg2cmor: Cannot find any nemo output files in /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/nemo/
ERROR:ece2cmor3.lpjg2cmor: NEMO variable fgco2 needed for target fco2nat in table Amon was not found in nemo output...
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3
I tried yesterday with updated ece2cmor and updated tables and varlist, still get the same error. I thought @nierad had fixed it last week, maybe the fix hasn't been merged to the trunk yet?
Maybe @klauswyser change the subject of this issue to : cmorisation fails due to multiple loaded modules on HPC
and close it.
For different issues please use a new or other issue ;)
I changed the subject.
The lesson learned is to be careful with multiple loaded modules on your system.
@treerink, why did you change the subject and closed the issue without consulting me first? The issue was about cmorising for OptimESM and included both IFS and LPJG problems. The IFS problem is solved but not the LPJG problem. In my eyes it's stupid to close unsolved issues and advice an author to open a new issue instead. This is good for statistics ("wow, see how many issues we have solved") but pisses off users that don't get a solution.
The IFS issue can be put ad acta, but the LPJG issue still exists:
ERROR:ece2cmor3.lpjg2cmor: Cannot find any nemo output files in /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/nemo/ ERROR:ece2cmor3.lpjg2cmor: NEMO variable fgco2 needed for target fco2nat in table Amon was not found in nemo output... ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3
I tried yesterday with updated ece2cmor and updated tables and varlist, still get the same error. I thought @nierad had fixed it last week, maybe the fix hasn't been merged to the trunk yet?
Hi @klauswyser , Sorry, I wasn't quite clear on the fix: There was use of old nemo-code which I thought was misleading here. The actual issue is the demand for fgco2 which was introduced by @etiennesky I think. I will look into this right now.
@klauswyser : I have created a branch that should fix the issue with not finding the right nemo files. Would you be able to test it quickly? I do not have a full set up running at the moment. The branch is "789-import-fgco2-in-LPJG"
Concerning the monthly LPJG
Amon fco2nat
: Would it be the easiest to just plainly cmorise LPJGAmon fco2nat
and NEMO NEMOOmon fgco2
and run thereafter a tiny post-process script (nco or cdo) which combines the two and delivers the finalAmon fco2nat
with a note about this on the wiki recommended strategies?
This seems to me still by far the easiest solution for delivering Amon fco2nat
.
@treerink : Well, I don't know:)It should work now and this is only ever supposed to happen in CC runs, no?
The cmorisation of LPJG output fails with Lars' branch that addresses reading of fgco2
from nemo:
INFO:ece2cmor3.taskloader: Created 137 ece2cmor tasks from input variable list.
INFO:ece2cmor3.ece2cmorlib: Selected 137 LPJG tasks from 137 input tasks
INFO:ece2cmor3.lpjg2cmor: Executing 137 lpjg tasks...
INFO:ece2cmor3.lpjg2cmor: Cmorizing lpjg tasks...
Traceback (most recent call last):
File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/./ece2cmor.py", line 157, in <module>
main()
File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/./ece2cmor.py", line 146, in main
ece2cmorlib.perform_lpjg_tasks(args.datadir, args.tmpdir, args.exp, refdate)
File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/ece2cmorlib.py", line 244, in perform_lpjg_tasks
lpjg2cmor.execute(lpjg_tasks)
File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/lpjg2cmor.py", line 224, in execute
if not check_time_resolution(lpjgfile, freq):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sm_wyser/cmorize/ece2cmor3/ece2cmor3/lpjg2cmor.py", line 333, in check_time_resolution
elif freq.startswith("yr"):
^^^^^^^^^^^^^^^^^^^^^
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
It seems the fgco2
error is gone, maybe a new (different problem)?
However, in the meantime something else seems to have happened with the master branch since yesterday:
INFO:ece2cmor3.taskloader: Created 139 ece2cmor tasks from input variable list.
INFO:ece2cmor3.ece2cmorlib: Selected 139 LPJG tasks from 139 input tasks
INFO:ece2cmor3.lpjg2cmor: Executing 139 lpjg tasks...
INFO:ece2cmor3.lpjg2cmor: Cmorizing lpjg tasks...
INFO:ece2cmor3.lpjg2cmor: Processing file /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/001/fco2antt_monthly.out
INFO:ece2cmor3.lpjg2cmor: Creating lpjg netcdf file for variable fco2antt for year 2000
INFO:ece2cmor3.lpjg2cmor: CMORizing variable fco2antt in table Amon from fco2antt in file fco2antt.out...
[...]
INFO:ece2cmor3.lpjg2cmor: Processing file /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/lpjg/001/fco2nat_monthly.out
INFO:ece2cmor3.lpjg2cmor: Creating lpjg netcdf file for variable fco2nat for year 2000
INFO:ece2cmor3.lpjg2cmor: Using the following cdo version for conservative remapping
Climate Data Operators version 2.3.0 (https://mpimet.mpg.de/cdo)
System: x86_64-conda-linux-gnu
CXX Compiler: /home/conda/feedstock_root/build_artifacts/cdo_1697853988338/_build_env/bin/x86_64-conda-linux-gnu-c++ -fPIC -DPIC -g -O2 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/sm_wyser/.conda/envs/ece2cmor3/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/cdo_1697853988338/work=/usr/local/src/conda/cdo-2.3.0 -fdebug-prefix-map=/home/sm_wyser/.conda/envs/ece2cmor3=/usr/local/src/conda-prefix -fopenmp -pthread
[...]
cdo add (Abort): Latitude orientation differ! First grid: N->S; second grid: S->N
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding remapped fgco2 variable from nemo output file /home/sm_wyser/optimesm/sm_wyser/ece-run/CD05/output/nemo/001/CD05_1m_20000101_20001231_pisces_grid_T_2D.nc to fco2nat in table Amon...
ERROR:ece2cmor3.lpjg2cmor: There was a problem adding nemo variable fgco2 to fco2nat in table Amon... exiting ece2cmor3
To process fgco2
cdo is called for remapping (the nemo output?), but it fails. Add -invertlat
? Full logfile
Should we follow Thomas' advice and open a new dedicated issue for this? In addition to this fgco2
issue (hopefully solved soon) there are still a dozen or so error messages about the new LPJGday and LPJGmon tables.
Hi @klauswyser , yes, maybe this does deserve a new issue. The "startswith()" issue is incredibly weird. It just disappeared when I checked out a new branch and I also thin, that the error message is wrong if you look at the definition of it...
EDIT: Yes, the invertlat might fix it.
I wonder if this ever worked before...?
@klauswyser , I have commited two fixes: one for the "startswith" issue and for now I have removed the "invertlat" in the cdo-command for fgco2. I am not sure, though, that this is enough. If this doesn't fix it I will open a new issue.
My general preference is one problem per issue (not an iron format), but with that it is easier to solve issues one by one.
Lars I think you branched off from a not update master. I tried to merge the master in your branch but meanwhile you made a commit, so I now first have to solve the git mixed up state.
hmm, now I start to see the downsides of git:) Shall I restartt my changes. and delete the branch?
Von: Thomas Reerink @.> Gesendet: Dienstag, 6. Februar 2024 12:51 An: EC-Earth/ece2cmor3 @.> Cc: Lars Nieradzik @.>; State change @.> Betreff: Re: [EC-Earth/ece2cmor3] cmorisation fails due to multiple loaded modules on HPC (Issue #789)
My general preference is one problem per issue (not an iron format), but with that it is easier to solve issues one by one.
Lars I think you branched off from a not update master. I tried to merge the master in your branch but meanwhile you made a commit, so I now first have to solve the git mixed up state.
— Reply to this email directly, view it on GitHubhttps://github.com/EC-Earth/ece2cmor3/issues/789#issuecomment-1929357687, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJA4D34R6BYBM2I4KCQ7XALYSIKNZAVCNFSM6AAAAABCS2FL4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGM2TONRYG4. You are receiving this because you modified the open/close state.Message ID: @.***>
No just do another pull in your branch, I have merged the master a second ago.
Ok, done!
Thanks and sorry!
Von: Thomas Reerink @.> Gesendet: Dienstag, 6. Februar 2024 12:57 An: EC-Earth/ece2cmor3 @.> Cc: Lars Nieradzik @.>; State change @.> Betreff: Re: [EC-Earth/ece2cmor3] cmorisation fails due to multiple loaded modules on HPC (Issue #789)
No just do another pull in your branch, I have merged the master a second ago.
— Reply to this email directly, view it on GitHubhttps://github.com/EC-Earth/ece2cmor3/issues/789#issuecomment-1929366132, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJA4D3YJDTRFQDN32ENYH5LYSILCHAVCNFSM6AAAAABCS2FL4OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGM3DMMJTGI. You are receiving this because you modified the open/close state.Message ID: @.***>
I updated ece2cmor3 this morning and also added the extra cc diagnostics. Then I try cmorising 1 leg of a test experiment with the latest output control files.
The LPJG error is pretty weird, why should there be a nemo subdriectory in lpjg?
The IFS error could possibly be related to #768, but not sure.
The model output (1 year) is available from http://exporter.nsc.liu.se/18929416faf7443497a154b4f2378e11