EC-Earth / ece2cmor3

Post-processing and cmorization of ec-earth output
Apache License 2.0
13 stars 6 forks source link

problem with variable mrsow in LS3MIP experiments #594

Closed francocatalano closed 4 years ago

francocatalano commented 4 years ago

The ece2cmor computation of LS3MIP variable mrsow requires constant field soil type (43.128). The LS3MIP experiments have been run saving this variable only in the in the +000000 ICMGG output file. For some reason, it looks like ece2cmor3 expects to find field 43.128 in all the output ICMGG files. This gives the following error when runing ece2cmor3: 2020-02-13 11:21:18 ERROR:ece2cmor3.grib_filter: Field missing in the first day of file: code 43.128, level type 1, level 0. Dismissing task mrsow in table Eday In order to solve this problem ece2cmor3 has to be modified to read constant field 43.128 only once from ICMGG+000000 file. @treerink @goord

treerink commented 4 years ago

@francocatalano Just checking: did you follow this wiki recommendation for cmorising?

francocatalano commented 4 years ago

@treerink In my case I did not start from restart but used IFS initial conditions files. Actually, I used the IC files you generated (it was the beginning of november 2019) for the r1 historical run for 1980-01-01. Therefore, I am starting to cmorize at leg 001 of the simulation.

goord commented 4 years ago

Hi @francocatalano I received your files, trying to fix this issue first

goord commented 4 years ago

Hi @francocatalano I created a fix in the branch fx-from-ini. Can you test by running the code at least on leg 1 and 2, and processing all variables?

BTW when you check out the branch, don't forget to run python setup.py install again.

goord commented 4 years ago

Hi Franco, to test my fix, run in your ece2cmor3 repo

git fetch
git checkout  fx-from-ini
conda activate ece2cmor3
python setup.py install

and then run your test

francocatalano commented 4 years ago

Hi @goord . It worked fine for leg 001. But when I launch it to leg 002 it gets stuck after a while and does not finish even after increasing the maximum walltime to 1h. That is very strange since leg 001 completes in about 20 minutes. Please also note that I did test the fix to #595 only on leg 001, therefore the problem might be already there.

goord commented 4 years ago

Hi Franco can you make your leg 002 readable to me? thx

francocatalano commented 4 years ago

Hi Franco can you make your leg 002 readable to me? thx

Hi @goord .Done. Let me know if you have problems accessing the files.

goord commented 4 years ago

Hi Franco I noticed you have ICM{GG,SH}ECE3+198012 in your leg 2 directory containing copies of fields at 1981-1-1 00:00:00. This confuses ece2cmor3, it will think that this is the actual first file of the leg. You can remove those files and the cmorization should then work properly.

francocatalano commented 4 years ago

Thanks @goord After removing the two ICM{GG,SH}ECE3+198012 files from 002 folder the job still gets stuck and, additionally, I now get this error in the .cmor.log file: Error: approximate time axis interval is defined as 86400.000000 seconds (1.000000 days), for value 1 we got a difference of 15854400.000000 seconds (183.500000 days), which is 18250.000000 % , seems too big, check your values

goord commented 4 years ago

Hi Franco, can you send me that log file so I can check which variable is causing that error?

francocatalano commented 4 years ago

@goord here is the file: ECE3-ifs-002-20200323091745.cmor.log

goord commented 4 years ago

Hi Franco, could you also send me the standard output stream with all messages from ece2cmor3 itself, probably your job output file?

francocatalano commented 4 years ago

here they are: ECE3-ifs-002-20200323091745.log

pbs-log-for-cmorising-ECE3-ifs-002.out.txt

goord commented 4 years ago

Hi Franco I can't reproduce your problem, I ran with the branch at ECMWF your leg 002, and it produced 98 variables in a 14 min. job with 18 threads,

proc_ifs_ls3mip.txt

goord commented 4 years ago

...are you sure you have cleaned your tmp directory too? After a crashing or hanging job this may cause problems

francocatalano commented 4 years ago

Hi @goord Tried again after removing temp-cmor-dir but got the same error in .cmor.log file.

francocatalano commented 4 years ago

here is the job script if it may help diagnose the problem: submit-at-cca-ece2cmor-leg-job-pdLC-ssp126.sh.txt

goord commented 4 years ago

Hi Franco, could you launch one more run with only 1 thread (so change the npp option for ece2cmor3 from 18 to 1 and the EC_threads_per_task = 1 as well). This will make the log more understandable.

francocatalano commented 4 years ago

Hi @goord I now managed to get rid of the error. After a few more tests (with 1 thread as well as with 18) I realized that the error arise only if I activate activateece2cmor3 manually before launching the submit script. If I did not manually activate ece2cmor3 it works, otherwise it gives that error. Perhaps, this is due to the fact that ece2cmor3 is already being activated inside the submit script so, manually activating it before somehow creates problems. Thank you for your support!

treerink commented 4 years ago

I realized that the error arise only if I activate activateece2cmor3 manually before launching the submit script. If I did not manually activate ece2cmor3 it works, otherwise it gives that error. Perhaps, this is due to the fact that ece2cmor3 is already being activated inside the submit script so, manually activating it before somehow creates problems. Thank you for your support!

That's weird, I often call the submit script both from the activated environment and from a plane login, and never noticed such differences. Though I am not sure I called the submit script on cca from the activated environment.

treerink commented 4 years ago

Ok great Franco that your tests are successful now. I will test once more the fx-from-ini branch for the general test-all case, before merging this branch into the master.

goord commented 4 years ago

That is quite strange. I thought it shouldn't matter since the script is being executed on a different machine with a clean environment. On the other hand, we don't know exactly how the conda activate script works, and how it may interfere with the cca modules...

Maybe something to add to the documentation

treerink commented 4 years ago

Okay will do the general tests with the latest fix. Some delay, I had to regain HPC access.

francocatalano commented 4 years ago

quick update. I have launched all the 121 years of one of our LS3MIP experiments. When checking the results with files-per-year.sh I found out that some (about 20 over 121) jobs did produce only 97 files instead of the 98 expected. I have checked the log files and found that an error occurred in reading the raw grib data. The good is that when I launched again the years with this problem they complete producing all the 98 files as expected. Now, the random errors may be due to some cca problems (yesterday the server underwent some maintenance) or to unpredictable ece2cmor3 behaviour. Just wanted to let you know.

treerink commented 4 years ago

Ok yes sounds like glitches, I encountered such glitches as well during cmorising the r1 results, which were indeed solved by rerunning. So yes the checks on the cmorised data are important.

francocatalano commented 4 years ago

Just checked the results with nctime and got some overlaps and broken time series: Number of dataset(s) with overlap(s): 53 Number of dataset(s) with broken time series: 4

I attach the nctime logfile for reference: nctcck-20200402-102615-16155.log

Any idea about these problems?

goord commented 4 years ago

Looks like 1980 has a problem, it should contain the time point 1981-1-1...

treerink commented 4 years ago

I just see the message is everywhere totally identical:

<-- overlap from 19810101000000 to 19810101000000

where the overlap dates are the same (so no real overlap?).

For the end message:

Number of dataset(s) with broken time series: 4

Could not spot any messages in the rest of the log, did you spot them?

treerink commented 4 years ago

I finished the general test and compared the ERROR logs with the latest ones posted in #542. Though a few small changes which I can't fully trace back, overall it seems fine. So I actually want to merge the fx-from-ini branch into the master.

francocatalano commented 4 years ago

I just see the message is everywhere totally identical:

<-- overlap from 19810101000000 to 19810101000000

where the overlap dates are the same (so no real overlap?).

For the end message:

Number of dataset(s) with broken time series: 4

Could not spot any messages in the rest of the log, did you spot them?

look for "broken" in the log file. lines: 978, 1834, 3055, 5865

francocatalano commented 4 years ago

Looking a bit more at the files I found another strange thing: most of the files span yyyy0101-yyyy1231 while a few of them span a shorter period: mrsow_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_19800101-19810101.nc mrsow_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_20620101-20620126.nc just to mention some of them.

further, the cmorization job logfiles associated to the above years apparently do not show any error.

@treerink @goord Are you sure the above problems are not due to ece2cmor3 bugs?

goord commented 4 years ago

Hi Franco, it could be still an issue with ece2cmor3, especially since it is mrsow causing these issues again. To figure out the problem I would like to have a look at the intermediate grib and nc file, but those are probably cleaned already.

What happens if run the leg once more?

francocatalano commented 4 years ago

@goord To sum up, we currently have two problems: 1) broken time series reported by nctime. This error does occur on the following files: mrsow_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_19800101-19810101.nc snc_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_198001-198012.nc hfdsn_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_198001-198101.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_19800101-19810101.nc

2) files produced with weird date intervals, as reported above. The files presenting this problem are: mrsow_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_19800101-19810101.nc mrsow_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_20620101-20620126.nc snc_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_203101-203102.nc hfdsn_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_198001-198101.nc hfdsn_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_201701-201709.nc hfdsn_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_202701-202707.nc hfdsn_LImon_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_203101-203102.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_19800101-19810101.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_19980101-19980122.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_20100101-20100519.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_20620101-20620126.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_20650101-20650404.nc hfdsn_Eday_EC-Earth3_amip-lfmip-pdLC_r1i1p1f1_gr_20990101-20990312.nc

Then, there are the strange messages about the overlaps which can be most probably ignored.

francocatalano commented 4 years ago

I am now re-running all the legs (as it was not clear at all which legs had problems). I'll keep you updated.

treerink commented 4 years ago

Just to make sure: take care the EC-Earth output directories are clean: so only files from your latest run are present and not files from older runs (though the recent ece2cmor should give warnings for this). And secondly take care when repeating your cmorising jobs again, that they end up in a brant new empty directory. This just to avoid painful digging around. Probably you did this all correct, but better to stress this once more.

francocatalano commented 4 years ago

All the legs complete. Two of them have this error: cdo(3) expr: Process started cdo(4) selcode: Process started Error (grb_read_record): Failed to read GRIB record

I confess that I am a bit scared about these random errors. Two days ago re-running the legs solved them but then I discovered the other problems. There should be a bug somewhere.

I attach the relative logfiles. ECE3-ifs-070-20200402192136.cmor.log ECE3-ifs-070-20200402192136.log ECE3-ifs-105-20200402195937.cmor.log ECE3-ifs-105-20200402195937.log

francocatalano commented 4 years ago

Just to make sure: take care the EC-Earth output directories are clean: so only files from your latest run are present and not files from older runs (though the recent ece2cmor should give warnings for this). And secondly take care when repeating your cmorising jobs again, that they end up in a brant new empty directory. This just to avoid painful digging around. Probably you did this all correct, but better to stress this once more.

@treerink Yes, I did make sure all the above. Thanks

francocatalano commented 4 years ago

I have re-run the legs with error and, as expected, they completed without any error. Then I checked the results with nctime and, again, got a couple of datasets with broken timeseries. I attach the nctime logfile nctcck-20200406-161340-32138.log

As you can see (look for BREAK in the file), the broken time series error involves many files (all those with wrong date intervals in the name) and many legs. Furthermore, these errors did not pop up in the ece2cmor3 log files. Having re-run ece2cmor3 over different days and got the same kind of random errors I would exclude the errors are caused by cca maintenance (the server has not been in maintenance for so many days).

goord commented 4 years ago

Hi Franco I think you are right, and I bet that you do not get these errors if you run with the 1.4 version (and leave out mrsow). In other words, I believe my changes to be able to cmorize mrsow may have caused a bug in ece2cmor3...

I observe that the BREAK statements always occur in hfsdn_* files, which is too suspicious to be file system glitches. Can you send me ece2cmor3 log output for one of those incorrect legs to get more information?

Then there is the first leg 1980. Not sure what's going on there, it should not include the time point 1-1-1981 0:0:0. I will have a look at it on cca (you have given me permissions rigt?).

goord commented 4 years ago

By the way I just discovered looking at these variables that the radiative fluxes are missing in hfdsl... see #620

goord commented 4 years ago

By the way @francocatalano the previous run you did was with the latest version of the branch right? And with 18 threads?

treerink commented 4 years ago

@goord: The fx-from-ini branch has already been merged into the master, see: #619. (And because ec-earth3.3.2.1 has been just released, ece2cmor3 v1.4 needs to be released now, so this might be urgent!).

goord commented 4 years ago

@treerink understood, will work on it today

goord commented 4 years ago

Hi Franco, I am able to recreate your problem, the issue is with the filtering of the heat flux in snow, I somehow must have introduced a bug there

francocatalano commented 4 years ago

Hi @goord Thanks. Let me know if you need more info from my side.

goord commented 4 years ago

Hi @francocatalano it turns out my changes to fix the mrow issue have prevented the shift of accumulated variables during grib filtering, which makes your cmorized data incorrect.

I have committed a fix on the master branch and I am currently testing it, you can do the same.

francocatalano commented 4 years ago

Ok @goord How do I get the fix?

treerink commented 4 years ago

@francocatalano :

git checkout master
git pull
python setup.py develop   # in your ece2cmor3 root directory
goord commented 4 years ago

I still get weird time bounds for hfdsn in the LImon table every now and then, it's quite strange. The grib file looks fine but the post-processed nc file often way too short...