ifs2cmor crashes while remapping

nierad commented 4 years ago

Hi all, hope you're all still well!

I have an issue cmorizing one leg of an offline run. All other legs are fine. We have already rerun the leg, results are identical. It is an OSM run. Here is the python error:

Processing osm leg 002 of experiment LM02 Traceback (most recent call last): File "/home/x_larni/miniconda2/envs/ece2cmor3/bin/ece2cmor", line 11, in load_entry_point('ece2cmor3==1.3.0', 'console_scripts', 'ece2cmor')() File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ece2cmor.py", line 141, in main cdothreads=args.ncdo) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ece2cmorlib.py", line 188, in perform_ifs_tasks ifs2cmor.execute(ifs_tasks + [area_task], nthreads=taskthreads) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ifs2cmor.py", line 208, in execute pool.map(cmor_worker, proctasks) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/multiprocessing/pool.py", line 253, in map return self.map_async(func, iterable, chunksize).get() File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/multiprocessing/pool.py", line 572, in get raise self._value TypeError: 'int' object is not iterable

I am not familiar with either IFS output nor with ifs2cmor very much, so I hope you guys will be able to quickly identify what might be causing this. Furthermore, ece2cmor3 doesn't crash, it just hangs.

Thanks a lot! Have a nice weekend!

goord commented 4 years ago

hmm this may be an error originating from the gridcell area changes I did... Lars could you run with a single thread (--npp 1) and post the log?

nierad commented 4 years ago

Hi @goord ,

here it is:

Processing osm leg 002 of experiment LM02 Traceback (most recent call last): File "/home/x_larni/miniconda2/envs/ece2cmor3/bin/ece2cmor", line 11, in load_entry_point('ece2cmor3==1.3.0', 'console_scripts', 'ece2cmor')() File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ece2cmor.py", line 141, in main cdothreads=args.ncdo) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ece2cmorlib.py", line 188, in perform_ifs_tasks ifs2cmor.execute(ifs_tasks + [area_task], nthreads=taskthreads) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ifs2cmor.py", line 205, in execute cmor_worker(task) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ifs2cmor.py", line 224, in cmor_worker define_cmor_axes(task) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ifs2cmor.py", line 450, in define_cmor_axes create_time_axes(task) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ifs2cmor.py", line 609, in create_time_axes name=time_dim, has_bounds=(time_operator != ["point"])) TypeError: 'int' object is not iterable leg 002 for osm of experiment LM02 failed.

Thanks, Lars

goord commented 4 years ago

Hi Lars can you post the entire log output as well? I just pushed a bug fix but I'm not sure this will solve the underlying issue.

nierad commented 4 years ago

Hi Gijs,

files are attached. LM02-osm-002-20200330161110.log LM02-osm-002-20200330161110.cmor.log

goord commented 4 years ago

Hi Lars there is something wrong with your mrsol cmorization. Can you run your cmorization again with the environment variable

export ECE2CMOR3_IFS_CLEANUP=false

and after that inspect the file mrsol_Emon.nc and 42.128.111_41.128.111_40.128.111_39.128.111.3. Perhaps they are still there in the directory /tmp/cmorize/LM02/ifs/002/LM02-ifs-1851/? Ece2cmor3 reports it can't find any time stamps in those files, which means they are either incorrectly filtered somehow or missing in the grib files?

nierad commented 4 years ago

Hi Gijs, the mrsol_Emon.nc doesn't exist. I have converted the 42.128.111_41.128.111_40.128.111_39.128.111.3-file to nc using cdo -f nc copy 42...11.3 mrsol.nc which throws some warnings:

[x_larni@tetralith2 LM02-ifs-1851]$ cdo -f nc copy 42.128.111_41.128.111_40.128.111_39.128.111.3 mrsol.nc Warning (cgribexScanTimestep1) : Record 5 (id=39.128 lev1=0 lev2=7) timestep 1: Inconsistent verification time! Warning (cdfDefRgrid) : Creating a NetCDF file with data on a gaussian reduced grid. Warning (cdfDefRgrid) : The further processing of the resulting file is unsupported! cdo copy: Processed 1037627840 values from 4 variables over 2919 timesteps ( 10.62s 22MB )

There is no time-dimension. Could this have gotten lost in the cmorization process? I think it's really strange because it only happens in this very leg.

I have uploaded the two files to: http://stormbringer.nateko.lu.se/public/lars/cmor_issue612/

goord commented 4 years ago

Hi Lars, I can't find what's wrong with that filtered grib file, but for some reason cdo doesn't handle it correctly, it skippes the 2nd time step (1851-01-01 3 o'clock).

etiennesky commented 4 years ago

Hi is it possible that there is a hole in the data? The OSM output is netcdf, which was converted to grib trying to be as close as possible to the ifs output.

@nierad can you upload the raw grib files for the first 2 legs?

nierad commented 4 years ago

Hi @etiennesky I don't have enough space to upload all of the files. Which ones do you need? The zipped ones or the ICMGG????? files? Do you need the full year?

Sorry, miscalculated. I'll upload it!

http://stormbringer.nateko.lu.se/public/lars/cmor_issue612/leg1.tar http://stormbringer.nateko.lu.se/public/lars/cmor_issue612/leg2.tar

etiennesky commented 4 years ago

ok @nierad just read your (edited) message, downloading them now. (will take a bit).

etiennesky commented 4 years ago

@nierad I cannot see anything wrong with the files, but it might be possible that the problem is due to duplicates since you linked all the files as such ?

ICMGGLM02+185001 -> ICMGG2ctd+185001

nierad commented 4 years ago

@etiennesky I don't think so. We actually reran these legs because of this (and because of file I destroyed somewhere else). It is the same for the first 10 legs but it only happens in leg 2.

goord commented 4 years ago

Lars, what happens if you run

grib_copy -w,paramId=39/40/41/42 ICMGGLM02+185001 mrsol.grib

Is the output grib readable by cdo?

nierad commented 4 years ago

Hi Gijs,

it works. When I then do cdo -f nc copy mrsol.grib mrsolfrmrsol.nc the resulting nc-files look ok.

goord commented 4 years ago

ok but that file you created should be the same as your file 42.128.111_41.128.111_40.128.111_39.128.111.3 that you put on the ftp. So could you somehow diff them?

nierad commented 4 years ago

This is how they differ externally: 740154168 Apr 1 16:32 42.128.111_41.128.111_40.128.111_39.128.111.3 62823360 Apr 3 09:40 mrsol.grib

goord commented 4 years ago

O wait I'm sorry the grib file you created is just for one month, you need to do the filter for all gribs and then concatenate. On the other hand, cdo was complaining about the timestamp of record 5 so you may also filter out januari of the ece2cmor3-filtered grib and compare...

nierad commented 4 years ago

This is how they differ externally: 740154168 Apr 1 16:32 42.128.111_41.128.111_40.128.111_39.128.111.3 739694400 Apr 3 10:25 mrsol1851.grb after cdo -> nc: 2844948 Apr 1 16:43 42.128.111_41.128.111_40.128.111_39.128.111.3.nc 4150536908 Apr 3 10:26 mrsolfrmrsol1851.nc

How can I filter out January?

nierad commented 4 years ago

When I do a cdo -f nc selmon,1 on the two grib-files I get no time-dim from the 42.128...-file while the one from mrsol1851.grb is ok.

goord commented 4 years ago

Hi Lars I think I found something. When I run

grib_ls -w dataDate=18510101 -p,dataDate,dataTime,paraId,typeOfLevel,levels 42.128.111_41.128.111_40.128.111_39.128.111.3

I get

dataDate     dataTime     paraId       typeOfLevel  levels       
18510101     0            not_found    depthBelowLandLayer  0-188       
18510101     0            not_found    depthBelowLandLayer  188-240     
18510101     0            not_found    depthBelowLandLayer  240-16      
18510101     0            not_found    depthBelowLandLayer  16-156      
18510101     300          not_found    depthBelowLandLayer  0-7         
18510101     300          not_found    depthBelowLandLayer  7-28        
18510101     300          not_found    depthBelowLandLayer  28-100      
18510101     300          not_found    depthBelowLandLayer  100-MISSING 
18510101     600          not_found    depthBelowLandLayer  0-7         
18510101     600          not_found    depthBelowLandLayer  7-28        
18510101     600          not_found    depthBelowLandLayer  28-100      
18510101     600          not_found    depthBelowLandLayer  100-MISSING 
18510101     900          not_found    depthBelowLandLayer  0-7         
18510101     900          not_found    depthBelowLandLayer  7-28        
18510101     900          not_found    depthBelowLandLayer  28-100      
18510101     900          not_found    depthBelowLandLayer  100-MISSING 
18510101     1200         not_found    depthBelowLandLayer  0-7         
18510101     1200         not_found    depthBelowLandLayer  7-28        
18510101     1200         not_found    depthBelowLandLayer  28-100      
18510101     1200         not_found    depthBelowLandLayer  100-MISSING 
18510101     1500         not_found    depthBelowLandLayer  0-7         
18510101     1500         not_found    depthBelowLandLayer  7-28        
18510101     1500         not_found    depthBelowLandLayer  28-100      
18510101     1500         not_found    depthBelowLandLayer  100-MISSING 
18510101     1800         not_found    depthBelowLandLayer  0-7         
18510101     1800         not_found    depthBelowLandLayer  7-28        
18510101     1800         not_found    depthBelowLandLayer  28-100      
18510101     1800         not_found    depthBelowLandLayer  100-MISSING 
18510101     2100         not_found    depthBelowLandLayer  0-7         
18510101     2100         not_found    depthBelowLandLayer  7-28        
18510101     2100         not_found    depthBelowLandLayer  28-100      
18510101     2100         not_found    depthBelowLandLayer  100-MISSING 
32 of 11680 messages in 42.128.111_41.128.111_40.128.111_39.128.111.3

As you can see, the first time point has different levels! I believe this may cause cdo to get confused and consequently report incorrect timestamps

goord commented 4 years ago

Can you confirm that these unknown levels are there in the original gridpoint grib file?

etiennesky commented 4 years ago

This might be because those variables have a strange post-processing when converting from netcdf to grib in liblsm.sh function osm_post_output() :

## add depthBelowLandLayer to soil moisture/temp to conform to IFS standard (needed for lpjg_forcing)
--
${grib_set} -w paramId=39/139 -s indicatorOfTypeOfLevel=112,topLevel=0,bottomLevel=7 $FGOUT temp ; mv temp $FGOUT
${grib_set} -w paramId=40/170 -s indicatorOfTypeOfLevel=112,topLevel=7,bottomLevel=28 $FGOUT temp ; mv temp $FGOUT
${grib_set} -w paramId=41/183 -s indicatorOfTypeOfLevel=112,topLevel=28,bottomLevel=100 $FGOUT temp ; mv temp $FGOUT
${grib_set} -w paramId=42/236 -s indicatorOfTypeOfLevel=112,topLevel=100,bottomLevel=missing $FGOUT temp ; mv temp $FGOUT

This happens only in the first month of the second year @nierad ? The other subsequent years are fine?

nierad commented 4 years ago

@etiennesky : Yes, that's right. Only year 2 (1851).

etiennesky commented 4 years ago

Looking at the raw data @nierad made available, I don't see anything wrong with it. If I filter the codes 39/40/41/42 I am able to join 185012 with 185101 and cdo doesn't complain.

The problem lies in the filtering somehow.

goord commented 4 years ago

Mysterious. Does the issue occur over and over again? I'm downloading the leg and will debug the filtering. Lars which varlist are you using, are you able to post it perhaps?

nierad commented 4 years ago

Hi Gijs, I put the file to our ftp-server: http://stormbringer.nateko.lu.se/public/lars/cmor_issue612/cmip6-data-request-varlist-LS3MIP-land-hist-EC-EARTH-Veg.json

nierad commented 4 years ago

Hi. Sorry, I am missing the cmorised file for 1850, too. I don't know now if this has been missing from the beginning or just since I rerun it. Could that be the issue? @etiennesky:

Looking at the raw data @nierad made available, I don't see anything wrong with it. If I filter the codes 39/40/41/42 I am able to join 185012 with 185101 and cdo doesn't complain.

Does that mean that the bug could be in the 1850-files, too? I will rerun it when I am done with another cmorization (which uses v1.2).

etiennesky commented 4 years ago

Hi. Sorry, I am missing the cmorised file for 1850, too. I don't know now if this has been missing from the beginning or just since I rerun it. Could that be the issue?

would have to see the 3rd year (just the first month) to know for sure.

@etiennesky:

Looking at the raw data @nierad made available, I don't see anything wrong with it. If I filter the codes 39/40/41/42 I am able to join 185012 with 185101 and cdo doesn't complain.

Does that mean that the bug could be in the 1850-files, too? I will rerun it when I am done with another cmorization (which uses v1.2).

since the bug seems to happen on 185101010300 I thought it might be because of some discontinuity in the output, which doesn't seem to be the case

nierad commented 4 years ago

The 3rd years cmorized file looks fine. Which file do you want to look at @etiennesky?

etiennesky commented 4 years ago

The 3rd years cmorized file looks fine. Which file do you want to look at @etiennesky?

the ICMGG file fr the first month of the 3rd leg - ICMGG2ctd+185201

nierad commented 4 years ago

Ok, I uploaded it too. http://stormbringer.nateko.lu.se/public/lars/cmor_issue612/ICMGG2ctd+185201

nierad commented 4 years ago

Ok, this happens when cmorizing leg 001:

Processing osm leg 001 of experiment LM02 Traceback (most recent call last): File "/home/x_larni/miniconda2/envs/ece2cmor3/bin/ece2cmor", line 11, in load_entry_point('ece2cmor3==1.3.0', 'console_scripts', 'ece2cmor')() File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ece2cmor.py", line 141, in main cdothreads=args.ncdo) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ece2cmorlib.py", line 188, in perform_ifs_tasks ifs2cmor.execute(ifs_tasks + [area_task], nthreads=taskthreads) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/ifs2cmor.py", line 181, in execute multi_threaded=(nthreads > 1)) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/grib_filter.py", line 321, in execute multi_threaded=False, once=True) File "/home/x_larni/miniconda2/envs/ece2cmor3/lib/python2.7/site-packages/ece2cmor3-1.3.0-py2.7.egg/ece2cmor3/grib_filter.py", line 356, in execute_tasks spectral_start_date = sorted(spectral_files.keys())[0] IndexError: list index out of range leg 001 for osm of experiment LM02 failed.

goord commented 4 years ago

ok, will fix that

etiennesky commented 4 years ago

Well I can't say that anything in the data provided is suspicious, all years have similar structure.

There might be an issue because the first year does not have data on Jan 1st at 00h, it starts at 03h. That is a feature of the OSM I am afraid.

nierad commented 4 years ago

Thanks all for helping. Another Observation: A similar run with different Landuse-settings ("LM03", net-transitions) did go through smoothly, i.e. no problems in any year. Same varlist, same metadata-file.

goord commented 4 years ago

The feature that jan. 1st 0hr fields are missing seems to causing random crashes and issues with the hfdsn in the LImon table. Lars can you check those files, do they have the correct end time each year (yyyy12)?

nierad commented 4 years ago

Hi Gijs, I am not sure if I understood correctly, but I checked the first 10 years of hfdsn using: cdo -f nc copy $x/ICMGGLM02+????12 ${x}_12.nc and they all have the same time-dimension, i.e. starting at 0 and ending in 741. Also, re "feature" : I understand Etienne's comment such that Jan 1st 0:00 values are only missing for the first year of the first leg.

nierad commented 4 years ago

Correction to my earlier post

Another Observation: A similar run with different Landuse-settings ("LM03", net-transitions) did go through smoothly, i.e. no problems in any year. Same varlist, same metadata-file.

There are two years (1962,1982) that have crashed with the same error as 6 posts earlier.

nierad commented 4 years ago

Can you confirm that these unknown levels are there in the original gridpoint grib file?

Damn, I must have missed this comment. This could be a hind! If do a

for xx in {1850..1858}; do x=${xx}01; grib_ls -w dataDate=${x}01 -p,dataDate,dataTime,paraId,typeOfLevel,levels /proj/s-cmip/users/lu/lpjg/gswp3/osm_output/LM02/output/osm/00?/ICMGGLM02+${x}; done

I get this for all years (only first lines printed here)

dataDate     dataTime     paraId       typeOfLevel  levels 
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    surface      not_found   
18510101     300          not_found    depthBelowLandLayer  0-7         
18510101     300          not_found    depthBelowLandLayer  7-28        
18510101     300          not_found    depthBelowLandLayer  28-100      
18510101     300          not_found    depthBelowLandLayer  100-MISSING 
18510101     300          not_found    surface      not_found   
...

Note, there is no timestep 0. In none of the grib-files. I hope, this helps.

P.S.: How can I make it scroll inside the box? Edit: Ok, seems to work after submitting;)

treerink commented 3 years ago

Seems solved, closing. Reopen if necessary.

EC-Earth / ece2cmor3

ifs2cmor crashes while remapping #612