ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Fx variable data not found at all when two Ofx directories for two different experiments exist, but one is incomplete. #1309

Closed ledm closed 2 years ago

ledm commented 2 years ago

I've found an issue where _recipe.py:_get_fx_files() fails to find an FX file for areacello.

The NCAR/CESM2-WACCM model has provided Ofx for both historical and piControl experiements. However, the areacello is only in the piControl, not in the historical one. I think the presences of an incomplete Ofx directory in historical confuses the FX finder.

ie:

$ ls /badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/*/r1i1p1f1/Ofx/

/badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/Ofx/:
sftof

/badc/cmip6/data/CMIP6/CMIP/NCAR/CESM2-WACCM/piControl/r1i1p1f1/Ofx/:
areacello  deptho  sftof  volcello

When I run the preprocessor/diagnostic below, the fx finder is unable to locate the arecello fx files. The fx finder finds the historical Ofx directory, but it does not contain an areacello directory. The problem is that it does not continue the search for the areacello in the piControl directory.

Here's my preprocessor:

  prep_profile_2:
    area_statistics:
      fx_variables:
      - {activity: CMIP,  grid: gn, mip: Ofx, short_name: areacello}
      operator: mean

And the diagnostic:

diagnostics:
  diag_timeseries_3:
    scripts:
      Model_range_polots: script: ocean/diagnostic_timeseries.py
    variables:
      tos_profile_hist:
        additional_datasets:
        - {dataset: CESM2-WACCM, end_year: 2010, ensemble: r1i1p1f1, exp: historical, grid: gn, mip: Omon, project: CMIP6, start_year: 2000}
        mip: Omon
        preprocessor: prep_profile_2
        short_name: thetao

Not that the full recipe contains hundred of datasets, so I'm not going to change the preprocessor for a specific dataset.

ledm commented 2 years ago

Note, this issue was first raised here: https://github.com/ESMValGroup/ESMValCore/issues/1282#issuecomment-915980954

valeriupredoi commented 2 years ago

OK cheers @ledm - I've also reported this to JASMIN, this is rather poor standards of accepting data into ESGF. That historical/Ofx dir should either not exist or be fully populated with symlinks pointing to all the Ofx piControl variables

valeriupredoi commented 2 years ago

I think the presences of an incomplete Ofx directory in historical confuses the FX finder.

That's exactly that! And in all fairness, you can't blame the poor data finder code, it's not AI to understand ESGF poor data structures :grin: We need an extra check layer, and most of the times stuff's in 'piControl for most of the models so I propose we search there before exiting and saying data was not found, what do you think @schlunma ?

schlunma commented 2 years ago

I don't think that the data finder is supposed to change the exp automatically in this case. If you specify the fx variables like this

      fx_variables:
      - {activity: CMIP,  grid: gn, mip: Ofx, short_name: areacello}

the data finder will only look in the experiment you specified for the main variable (exp: historical). Please correct me if I'm wrong. I thought this "automated" search only works for different mips.

valeriupredoi commented 2 years ago

oh I though it'd go about and look for other exps if that particular exp that was specified did not have a data dir - but I might be thinking wishfully and it's actually the case that it's not looking any further. We should so implement that if it's not there yet! I propose look into piControl right before quitting the show, making sure fx stuff's not there

zklaus commented 2 years ago

So to be clear, I think you should just specify the experiment here for the fx_variables as well and that should be enough.

ledm commented 2 years ago

Okay, I did that, but it's not enough.

    area_statistics:
      fx_variables:
      - {activity: CMIP,  grid: gn, mip: Ofx, short_name: areacello, exp: piControl}
      operator: mean

The problem is that ESMValTool can't figure out the ensemble name now! It looking for the same ensemble number that was provided in additional_datasets.

At the same time, you can't just follow the CMIP6 standard, which I thought to provide the areacello data in: piControl, r1i1p1f1, Ofx

There are actually many different ensemble numbers in CMIP6!

I don't want to write a separate preprocessor for each ensemble number. This would be a waste of time. In most cases, models only provide a single ensemble member for their piControl, so ESMValTool should be able to figure it out. Or at the very least, I should be able to specific a list in my area_statistics preprocessor:

      - {activity: CMIP,  grid: gn, mip: Ofx, short_name: areacello, exp: piControl, ensemble: [r1i1p1f1, - r1i1p1f2, r1i1p2f1, r2i1p1f1, r1i2p1f1,...]}

Aside: personally, I think that users should not have to decide, but rather ESMValTool should provide a ranked list of places to look for FX files. I'm happy to provide my ranking for CMIP6. Here it is, in terms of exp, ensemble, mip:

where Same means the same value provided in the dataset. This is how I've had to do it in my custom dataset builder scripts.

schlunma commented 2 years ago

Something similar is implemented in #1082, which allows you to use wildcards for any entry in the fx_variables dictionaries, e.g., exp: "*" or ensemble: "*". Give it a try if you like, I just solved the merge conflicts in the branch.

ledm commented 2 years ago

Thanks @schlunma, that looks like it may solve the problem! I've set both ensemble and exp to wildcards, which should hopefully mean that areacello will be found, if it exists. Cheers!

ledm commented 2 years ago

Okay, just to confirm that @schlunma's PR has fully resolved this issue! Thanks Manuel and @thomascrocker!