CMORize Duveiller2018 - Githubissues

bascrezee commented 5 years ago

I am building on cmorize_obs_Landschuetzer2016.py to cmorize another obs dataset. This is my yml file:

Content of Duveiller2018.yml

---
# Common global attributes for Cmorizer output
attributes:
  dataset_id: Duveiller2018 #TODO where should I document the full reference to this dataset?
#  version: 'v2016' # There is no version.
  tier: 2
  modeling_realm: clim
  project_id: CMIP5 #TODO What to put here?
  source: 'https://www.nature.com/articles/sdata201814'
  reference: 'Duveiller2018'
  comment: ''

# Variables to cmorize
variables:
  alb:
    mip: Amon
    # Match CMOR variables with input file one
    raw: Delta_albedo
    # input file name
    file: albedo_IGBPgen.nc

I follow very closely the script by @tomaslovato . Also the variable 'alb' is defined in the custom tables. However, I run into the following error:

 2019-04-30 15:08:21,986 INFO     esmvaltool.utils.cmorizers.obs.cmorize_obs_Duveiller2018,89    CMORizing var alb from file /net/exo/landclim/PROJECTS/C3S/datadir/rawobsdir/Tier2/Duveiller2018/albedo_IGBPgen.nc
Traceback (most recent call last):
  File "/net/exo/landclim/crezees/conda/envs/esmvaltool-public/bin/cmorize_obs", line 11, in <module>
    load_entry_point('ESMValTool', 'console_scripts', 'cmorize_obs')()
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs.py", line 201, in execute_cmorize
    _cmor_reformat(config_user, obs_list)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs.py", line 260, in _cmor_reformat
    module_root + dataset)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs.py", line 122, in _run_pyt_script
    py_cmor.cmorization(in_dir, out_dir)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs_Duveiller2018.py", line 103, in cmorization
    extract_variable(var_info, raw_info, out_dir, glob_attrs)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs_Duveiller2018.py", line 50, in extract_variable
    var = var_info.short_name
AttributeError: 'NoneType' object has no attribute 'short_name'

I tried to further trace down the problem, but at some stage I got lost. See my own attempt of a traceback below, if it helps.

# PROBLEM: var_info is None, so the below function returns none. Where is this function? 
var_info = cmor_table.get_variable(vals['mip'], var)

# 
cmor_table = CFG['cmor_table']

# 
CFG = _read_cmor_config('Duveiller2018.yml')

# 
def _read_cmor_config(cmor_config):
    cfg['cmor_table'] = \
        CMOR_TABLES[cfg['attributes']['project_id']]

# So the CMOR table is defined in the YML. Makes sense. I provide it as CMIP5. So the above line reads:
CMOR_TABLES['CMIP5']

# But what exactly is the object CMOR_TABLES? It is imported at the top as:
from esmvaltool.cmor.table import CMOR_TABLES

# It starts as an empty dictionary in table.py
CMOR_TABLES = {}

# So where is this object initiated? At this stage I am lost.

Is it correct that I take CMIP5 as a project ID? Or should it indicate 'custom' since this is a custom variable? Any ideas on what goes wrong here?

mattiarighi commented 5 years ago

Since alb is a custom variable you need to read from the custom table.

bascrezee commented 5 years ago

The above error has been solved, thanks.

I ran into another error. I am quite sure that the 'standard_name' in CMOR_alb.dat is supposed to be left empty, but it raises an error. However, changing it to some random other valid standard name does not remove the error, so it seems not fully related.

2019-05-06 15:24:49,085 INFO     esmvaltool.utils.cmorizers.obs.cmorize_obs_Duveiller2018,89    CMORizing var alb from file /net/exo/landclim/PROJECTS/C3S/datadir/rawobsdir/Tier2/Duveiller2018/albedo_IGBPgen.nc
Traceback (most recent call last):
  File "/net/exo/landclim/crezees/conda/envs/esmvaltool-public/bin/cmorize_obs", line 11, in <module>
    load_entry_point('ESMValTool', 'console_scripts', 'cmorize_obs')()
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs.py", line 201, in execute_cmorize
    _cmor_reformat(config_user, obs_list)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs.py", line 260, in _cmor_reformat
    module_root + dataset)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs.py", line 122, in _run_pyt_script
    py_cmor.cmorization(in_dir, out_dir)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs_Duveiller2018.py", line 103, in cmorization
    extract_variable(var_info, raw_info, out_dir, glob_attrs)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/cmorize_obs_Duveiller2018.py", line 63, in extract_variable
    _fix_var_metadata(cube, var_info)
  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/utilities.py", line 43, in _fix_var_metadata
    cube.standard_name = var_info.standard_name
  File "/net/exo/landclim/crezees/conda/envs/esmvaltool-public/lib/python3.6/site-packages/iris/_cube_coord_common.py", line 128, in standard_name
    raise ValueError('%r is not a valid standard_name' % name)
ValueError: '' is not a valid standard_name

mattiarighi commented 5 years ago

@tomaslovato or @valeriupredoi can you help?

tomaslovato commented 5 years ago

@bascrezee At first I would say that it is a problem with the standard_name definition...

I saw that the following branch exists version2_development_cmorize_duveiller2018 If it is yours or connected to this issue, could you please upload in there both the Duveiller2018.yml and cmorize_obs_Duveiller2018.py so it will much easier to reproduce the error !

bascrezee commented 5 years ago

I just staged the files and pushed them. Thanks for looking into this!

valeriupredoi commented 5 years ago

the error is a standard iris error for non-standard standard names (CF conventions) :grin:

Here is an example of a custom cmor table for a variable which will not have any standard name since otherwise will break CF conventions and hence get the iris error above

!----------------------------------
! Variable attributes:
!----------------------------------
standard_name:
units:             1
cell_methods:      time: mean

valeriupredoi commented 5 years ago

the problem here is that the custom cmor table will not contain any entry for standard_name since it's a derived variable so the cmorizer will always fail because of that cube.standard_name = var_info.standard_name line. So we need to plug in a special case in the cmorizer utilities that accounts for derived variables. That's not going to be easy because the purpose of the cmorizer is to make cmor-compliant data that also adheres to CF standards; any way you can grab the rsds and rsus datasets so alb can be derived internally in ESMValTool?

bascrezee commented 5 years ago

Thanks Valeriu, I think I kind of get what you mean.

What you suggest as a solution, is not a solution here, since this observational dataset has no rsds or rsus. There is just values of (difference in) albedo.

valeriupredoi commented 5 years ago

in that case put a check in utilities.py eg

if var_info.standard_name == '':
    cube.standard_name = None

that will save the cube ok and will be ok when running it through ESMValTool since standard_name is None anyway from the custom table

mattiarighi commented 5 years ago

What you suggest as a solution, is not a solution here, since this observational dataset has no rsds or rsus. There is just values of (difference in) albedo.

That's correct. Derived variables are designed for models only, in order to compare with a variable which is only available in the OBS.

bascrezee commented 5 years ago

Solution works :) I'll keep the issue open until I finished the CMORization :)

bascrezee commented 5 years ago

I now arrived at taking care of the 'time' axis. This is a somewhat special case, since it is a climatological dataset. How should I deal with this within ESMValTool? (See ncdump below). There are CF conventions describing how NetCDF files with climatological statistics should look like, however, since the original dataset does not adhere to these conventions, it would be involving to get there... Any guidance?

Here is the ncdump:

netcdf albedo_IGBPgen {
dimensions:
    lon = 360 ;
    lat = 180 ;
    mon = 12 ;
    iTr = 6 ;
variables:
    double lon(lon) ;
        lon:units = "degreesE" ;
        lon:long_name = "Longitude" ;
    double lat(lat) ;
        lat:units = "degreesN" ;
        lat:long_name = "Latitude" ;
    int mon(mon) ;
        mon:units = "months" ;
        mon:long_name = "Month" ;
    double iTr(iTr) ;
        iTr:long_name = "Vegetation transition code" ;
    float Delta_albedo(iTr, mon, lat, lon) ;
        Delta_albedo:_FillValue = NaNf ;
        Delta_albedo:long_name = "Difference in surface albedo for a given vegetation cover transition" ;
    float SD_Delta_albedo(iTr, mon, lat, lon) ;
        SD_Delta_albedo:_FillValue = NaNf ;
        SD_Delta_albedo:long_name = "St.Dev. on the diff. in surface albedo for a given vegetation cover transition" ;
    float N_Delta_albedo(iTr, mon, lat, lon) ;
        N_Delta_albedo:units = "samples" ;
        N_Delta_albedo:_FillValue = NaNf ;
        N_Delta_albedo:long_name = "Number of samples from which the aggregated estimate is made" ;
}

bascrezee commented 5 years ago

Climatological data are not officially supported yet by Iris (https://github.com/SciTools/iris/issues/2904). Soon it will be possible to vote for this functionality in Iris (https://github.com/SciTools/iris/issues/3307). I now wonder if it makes sense to CMORize this dataset at this moment. Is it possible to simply read and plot this dataset in a custom diagnostic without running the CMORizing script? @mattiarighi Thanks for your help :)

mattiarighi commented 5 years ago

@ledm has cmorized some climatological data from the WOA dataset, you can try to use his script as an example.

tomaslovato commented 5 years ago

@bascrezee Actually You can define the timeline of the dataset using time instead of mon, by setting the correct year of reference for the climatology as done for WOA data. This make even more sense since the climatology is representative of a certain period and it should be better to have it explicitly associated to the data .

You can add a custom variable for the reference year similarly to WOA https://github.com/ESMValGroup/ESMValTool/blob/0b4ef0e7b1f124897a75981b0c82e47153742068/esmvaltool/utils/cmorizers/obs/cmor_config/WOA.yml#L40-L43

and then read it within the cmorization function of your cmorizer script using CFG['custom']['years'] and finally apply/set the time values to the cube, e.g, within extract_variable.

bascrezee commented 5 years ago

Sounds like a good approach. The original data contains a monthly climatology over 4 years (2008-2012). Is my understanding correct, that with the approach you suggest, 4 files will be written, one for each year? Each file will hold exactly the same data values. Since the data is not too big, this is a fine workaround.

bascrezee commented 5 years ago

I now run into another error:

  File "/home/crezees/ESMValTool/esmvaltool/utils/cmorizers/obs/utilities.py", line 131, in save_variable
    dates = reftime.num2date(cube_time.points[[0, -1]])
  File "/net/exo/landclim/crezees/conda/envs/esmvaltool-public/lib/python3.6/site-packages/cf_units/__init__.py", line 1988, in num2date
    cdf_utime = self.utime()
  File "/net/exo/landclim/crezees/conda/envs/esmvaltool-public/lib/python3.6/site-packages/cf_units/__init__.py", line 1902, in utime
    raise ValueError(emsg.format(interval))
ValueError: Time units with interval of "months", "years" (or singular of these) cannot be processed, got 'months'.

It has been reported before (https://github.com/ESMValGroup/ESMValTool/issues/516). For @schlunma it did work when using Iris v2.2.0, but not for me. I use cf_units v2.0.2. Any idea's what might go wrong here? (I will run the ESMValTests just to be sure that my installation is completely fine, keep you updated). update The tests are running fine...

tomaslovato commented 5 years ago

@bascrezee since you have a monthly climatology you need to set only one reference year, in this case I would suggest to set 2010 (middle of climatological period). Only one file has to be generated.

Note that source should point to the exact download path of the data so https://github.com/ESMValGroup/ESMValTool/blob/0b4ef0e7b1f124897a75981b0c82e47153742068/esmvaltool/utils/cmorizers/obs/cmor_config/Duveiller2018.yml#L9 should be reporting instead the nature download link https://ndownloader.figshare.com/files/9969496 or the amazon S3 archive full path https://s3-eu-west-1.amazonaws.com/pstorage-npg-968563215/9969496/albedo_IGBPgen.nc

bascrezee commented 5 years ago

Ok, thanks. But the start and end of the period should be included somehow as well, to describe the data correctly. I guess adding them to the global attributes makes sense?

mattiarighi commented 5 years ago

Or in the filename?

tomaslovato commented 5 years ago

time in filename so far matches with data content, so in this case the final cmorizes name should contain 201001-201012. It may be a good idea to add it in the global attributes.

@bascrezee To solve the issue with time the you reported it would probably be better to use a callback function when iris load the data to set the cube reference time and units.

bascrezee commented 5 years ago

Thanks. This callback works fine indeed. It now ran through :)

Now I am checking the file with recipe_check_obs.yml:

# ESMValTool
# recipe_check_obs.yml
---
documentation:
  description: |
    Test recipe for OBS, no preprocessor or diagnostics are applied,
    just to check correct reading of the CMORized data.

  authors:
    - righ_ma

preprocessors:
  nopp:
    extract_levels: false
    regrid: false
    mask_fillvalues: false
    multi_model_statistics: false

diagnostics:
  Duveiller2018:
    description: Duveiller2018
    variables:
      albDiff:
        preproc: nopp
        mip: Amon
    additional_datasets:
      - {dataset: Duveiller2018, project: OBS, tier: 2, version: v2018, start_year: 2010, end_year: 2010, frequency: mon}
    scripts: null

But I run into the following error. I do not fully understand the error message. It does not find the dataset key, but it is specified in the recipe.

File "/home/crezees/ESMValTool/esmvaltool/_data_finder.py", line 117, in _replace_tags
    "your recipe entry".format(tag, variable))
KeyError: "Dataset key type must be specified for {'preproc': 'nopp', 'mip': 'Amon', 'variable_group': 'albDiff', 'short_name': 'albDiff', 'diagnostic': 'Duveiller2018', 'preprocessor': 'default', 'dataset': 'Duveiller2018', 'project': 'OBS', 'tier': 2, 'version': 'v2018', 'start_year': 2010, 'end_year': 2010, 'frequency': 'mon', 'recipe_dataset_index': 0, 'cmor_table': 'OBS', 'standard_name': '', 'long_name': 'Difference in surface albedo for a given vegetation cover transition', 'units': '1', 'modeling_realm': ['atmos']}, check your recipe entry"

Branch: https://github.com/ESMValGroup/ESMValTool/tree/version2_development_cmorize_duveiller2018

valeriupredoi commented 5 years ago

the missing key is not dataset but type - if you look at the source code for the error:

            raise KeyError("Dataset key {} must be specified for {}, check "
                           "your recipe entry".format(tag, variable))

(look at it next time :grin: ) Type can be eg type: reanalysis but that depends on your data, dunno that :beer:

bascrezee commented 5 years ago

Oops.. :stuck_out_tongue_closed_eyes: Yes, I will look at the source code next time.

bascrezee commented 5 years ago

My dataset has a non-standard dimension called vegetation_transition_code. So I added this to the file CMOR_albDiff.dat:

dimensions: longitude latitude time vegetation_transition_code

But I run into the following error:

  File "/home/crezees/ESMValTool/esmvaltool/cmor/table.py", line 648, in _read_table_file
    table[value] = self._read_variable(value, None)
  File "/home/crezees/ESMValTool/esmvaltool/cmor/table.py", line 520, in _read_variable
    var.coordinates[dim] = self.coords[dim]
KeyError: 'vegetation_transition_code'

It seems as if I still need to define this dimension somewhere. Maybe @jvegasbsc can help me, since I noted that CMOR_clisccp.dat includes a non-standard dimension named tau.

tomaslovato commented 5 years ago

@bascrezee You need to add the information about your new axis vegetation_transition_code in CMOR_coordinates.dat, following the structure of the already available non-standard dimension.

bascrezee commented 5 years ago

Interestingly, whereas for all custom variable definitions we leave the standard_name blank, but not for the CMOR_coordinates.dat file. Do you have any idea why? @jvegasbsc

katjaweigel commented 5 years ago

At least in case of the derived variables I created the reason was, that the standard name in the variable definition hat to be in the list in IRIS std_names.py. Else You get an error. To remove that the easiest way is to leave the standard name blank in the derived variable file.

bascrezee commented 5 years ago

I picked up this work again today, after moving around some files due to the split into tool/core I got back to the stage where I was. The script runs through, but the CMORize checker is not happy yet.

esmvalcore.cmor.check.CMORCheckError: There were errors in variable albDiff:
iTr: standard_name should be , not None
 time: Frequency mon does not match input data
 albDiff: does not match coordinate rank
in cube:
Difference in surface albedo for a given vegetation cover transition / (1) (Vegetation transition code: 6; time: 12; latitude: 180; longitude: 360)
     Dimension coordinates:
          Vegetation transition code                                                                  x        -             -               -
          time                                                                                        -        x             -               -
          latitude                                                                                    -        -             x               -
          longitude                                                                                   -        -             -               x
     Attributes:
          Conventions: CF-1.5
          climatology_end: 2012-12-31T23:59:59Z
          climatology_start: 2008-01-01T00:00:00Z
          comment: 
          host: exo
          mip: Amon
          modeling_realm: clim
          project_id: custom
          reference: Duveiller, G., J. Hooker, A. Cescatti, Scientific Data 5, 180014 (2018...
          source: https://ndownloader.figshare.com/files/9969496
          source_file: /net/exo/landclim/PROJECTS/C3S/datadir/obsdir/Tier2/Duveiller2018/OBS_...
          tier: 2
          title: Duveiller2018 data reformatted for ESMValTool v2.0a2
          user: crezees
          version: v2018

I hope to tackle them one-by-one.

iTr: standard_name should be , not None In the CMOR_coordinates.dat I left standard name blank, as usual for custom variables.

time: Frequency mon does not match input data Is it possible that the CMOR checker does not know how to handle climatological data? See also the discussion above?

albDiff: does not match coordinate rank Hope this goes away as soon as iTr has been fixed, maybe it is related to that one.

Branches: landvariables [core repository ; for custom CMIP definitions] version2_development_cmorize_duveiller2018 [public repository ; cmorize scripts ]

Any help is appreciated !

bouweandela commented 5 years ago

Please ask @jvegasbsc for CMOR related issues

bascrezee commented 5 years ago

@jvegasbsc any thoughts on this?

I tried two options of fixing the custom coordinate, but both fail, see the comments in the code below. Is it possible that the CMOR checker fails in parsing correctly a custom defined CMOR coordinate? I think I am the first one adding a coordinate that does not have a valid standard name.

    for cube in cubes:
        if cube.var_name == rawvar:
            for cubecoord in cube.coords():
                if cubecoord.var_name=='iTr':
#                    cubecoord.standard_name = None # CMOR checker raises: iTr: standard_name should be , not None
                    cubecoord.standard_name = ''  # this script raises: ValueError: '' is not a valid standard_name

I can trace back the error to being raised in l. 104 (permalink does not embed because it's a different repository?), so it must be one of the checks before that fail.

https://github.com/ESMValGroup/ESMValCore/blob/dbcfb85715ea1ee130db7351980f719108ffabde/esmvalcore/cmor/check.py#L96-L104

bascrezee commented 5 years ago

Update: I decided to extract a certain vegetation cover transition code, after which this is not a dimension coordinate any more. This is a fine workaround for my case. But it might still be good to check if the CMOR checker allows for custom 'non-valid standard name' coordinate names.

bascrezee commented 5 years ago

Update: Time has been solved as well. CMORization done. Thanks for the support, especially to @tomaslovato. I will submit a PR early next week.

ESMValGroup / ESMValTool

CMORize Duveiller2018 #1038