ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Speed up loading for ACCESS-ESM non-CMOR datasets #2487

Open rhaegar325 opened 4 months ago

rhaegar325 commented 4 months ago

Hi, Develop team, in the last few month we developed a cmoriser for ACCESS-ESM raw data in ESMValCore. However, due to the different way to store the data(typically cmored data was store by single variable in all time-range in a file, ACCESS-ESM data was stored by one timestamp with all variables in one file), if we still use the default way in esmvalcore to load ACCESS-ESM data, that will cause a huge time and memory cost. so I was wondering if we could build a load method for ACCESS-esm raw data that will be super helpful, won't need to be conplex, just a filter to select file within time-range which specified in recipe would be good.

I open this issue to see if anyone have good idea about how to do that. I am willing to implement myself, just need to know which way was the best that both of us will accept.

rbeucher commented 4 months ago

Hi @bouweandela, @valeriupredoi,

We are encountering an issue with the output of the ACCESS-ESM model, specifically with the atmospheric data. The data is stored as follows:

./atm/netCDF/:
HI-CN-05.pa-185001_mon.nc
HI-CN-05.pa-185002_mon.nc
HI-CN-05.pa-185003_mon.nc
HI-CN-05.pa-185004_mon.nc
HI-CN-05.pa-185005_mon.nc
HI-CN-05.pa-185006_mon.nc
HI-CN-05.pa-185007_mon.nc
HI-CN-05.pa-185008_mon.nc
HI-CN-05.pa-185009_mon.nc
HI-CN-05.pa-185010_mon.nc

All monthly variables are stored in a single netCDF file.

Currently, our config-developer.yml is configured as follows:

ACCESS:
  cmor_strict: false
  input_dir:
    default:
      - '{dataset}/{sub_dataset}/{exp}/{modeling_realm}/netCDF'
  input_file:
      default: '{sub_dataset}.{special_attr}-*.nc'
  output_file: '{project}_{dataset}_{mip}_{exp}_{institute}_{sub_dataset}_{special_attr}_{short_name}'
  cmor_type: 'CMIP6'
  cmor_default_table_prefix: 'CMIP6_'

This configuration results in ESMValCore analyzing all files and variables, which consumes excessive time and resources.

I have suggested the following to @rhaegar325:

  1. Modify the input_file/default to include a time facet. This change should prevent loading more data than necessary when working within a constrained time range.
  2. Utilize the timerange facet in the input_file name. We could implement the approach described here.
  3. Leverage IRIS constrained loading capabilities. Details are available here. This could potentially speed up the process.

Given that all variables are stored in a single file, we are aware that this setup is not optimal. However, we currently have no alternative.

Any advice would be greatly welcome!

Thanks, R

bouweandela commented 4 months ago

Selecting the files within the specified timerange should already work, as it does for CMIP6 etc. Did you check in the main_log_debug.txt log file which files are actually getting loaded? In order for this to work, your timerange does need to be recognized by the code here: https://github.com/ESMValGroup/ESMValCore/blob/546937f6bb3648b39eb33d4fe501594bc608e949/esmvalcore/local.py#L66-L68

Leverage IRIS constrained loading capabilities.

You could probably implement this in the fix_file method of https://github.com/ESMValGroup/ESMValCore/blob/546937f6bb3648b39eb33d4fe501594bc608e949/esmvalcore/cmor/_fixes/access/access_esm1_5.py#L11 and make it return a cube instead of a filename and then modify esmvalore.preprocessor.load so it skips the actual load step if the input is already a cube. Similar to what I tried out in #2454. In the longer term, we would like to implement a more flexible loading mechanism (see https://github.com/ESMValGroup/ESMValCore/issues/2371), but we will first need to find funding for that.

rbeucher commented 3 months ago

Thanks @bouweandela, that is really useful. We are going to look into this.

valeriupredoi commented 3 months ago

time gating is one side of the problem, as Bouwe points out, another is variable selection which we don't do it anymore at load point (we used to have an iris Constraint at load raw point, though), what you can do about it though, you can overload it with a constraint, see load_raw and its usage - if this is a bit too much of a hassle, you can perform the single-variable loading via a fix, so that it runs ahead of everything else, a rather agricultural solution, but a fairly hassle-free one in me books :beer: