ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Preprocessor memory usage is excessive and error messages unclear. #51

Closed ledm closed 1 year ago

ledm commented 5 years ago

This issue was raised in pr #763 and was also raised by Rob Parker, from the UKESM project.

The recipe recipe_ocean_bgc.yml memory explodes when the time range is changed to a more realistic range. The time range is currently 2001-2004, but we may be interested in, for instance, the range 1950-2010.

The second problem is that the warning messages and error messages related to Memory problems in ESMValTool are unclear. It's not obvious which part of this recipe causes the memory problems.

bouweandela commented 5 years ago

Thank you for opening this issue. Can you please attach the precise recipe you are using to this issue and if possible, specify how much memory it requires or how much it uses when it crashes?

valeriupredoi commented 5 years ago

This issue was raised in pr #763 and was also raised by Rob Parker, from the UKESM project.

Please don't mix things up, Rob was complaining about a certain memory issue that is completely inexplicable given that the recipe he ran (the exact settings of it) was run by both Ranjini and me and no memory issues were found. It's a trivial recipe with trivial computations (no multimodel or any other memory-heavy operations).

I am starting to look at this specific recipe recipe_ocean_bgc.yml right now, can you pls tell me what are the recipe settings I should use to reproduce the memory issue?

valeriupredoi commented 5 years ago

The second problem is that the warning messages and error messages related to Memory problems in ESMValTool are unclear. It's not obvious which part of this recipe causes the memory problems.

This is easily diagnosable if you run in debug mode, when MemoryError stopes the run just backtrack in stdout and see what was the last completed step; OK, it is broadly diagnosable since the actual operation that ran out of memory is hard to find but at least you'll know what module produced the memory spill

ledm commented 5 years ago

Change the dataset line to:

  - {dataset: HadGEM2-ES,    project: CMIP5,  exp: historical,  ensemble: r1i1p1,  start_year: 1950,  end_year: 2004}

ie change start_year: 2001, to start_year: 1950 and you should start to see some high memory usage.

valeriupredoi commented 5 years ago

cool, cheers, man! will have a play now

valeriupredoi commented 5 years ago

can you pls advise where to get the WOA obs4mips files from on jasmin? Also, are you using the obs4mips in the multimodel? because if not, having just a single CMIP5 dataset means no multimodel computations will be done

ledm commented 5 years ago

I can copy over the WOA files now. Where shall I put them?

mattiarighi commented 5 years ago

Is this OBS or obs4mips? If OBS, it would be good if you could also provide the cmorize_obs script used to generate them.

ledm commented 5 years ago

Obs, I would guess. Whats the difference? Happy to provide the script, but it's not pretty.

valeriupredoi commented 5 years ago

it's obs4mips, but not playing any role in this issue actually, found the memory leak...hold on will post the result

valeriupredoi commented 5 years ago

OK guys, found the memory thief: I used Lee's recipe that I trimmed heavily but kept the 1950-2004 data range...and the bugger is fix_data; here's the postmortem:

Recipe:
------
datasets:
  - {dataset: HadGEM2-ES,    project: CMIP5,  exp: historical,  ensemble: r1i1p1,  start_year: 1950,  end_year: 2004}
preprocessors:
  # Global 3D Volume-weighted Average
  prep_timeseries_global_volume_average:
    custom_order: true
    average_volume:
      coord1: longitude
      coord2: latitude
      coordz: depth
      use_fx_files: true
diagnostics:
  diag_timeseries_volume_average_vs_WOA:
    description: Global volume average time series (WOA observations)
    variables:
      thetao: # Temperature 3D
        preprocessor: prep_timeseries_global_volume_average
        field: TO3M
        mip: Omon
        fx_files: [volcello,]
    scripts: Null

steps:
------
2019-01-29 13:10:21,599 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_file
2019-01-29 13:10:21,612 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step load
2019-01-29 13:10:22,341 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_metadata
2019-01-29 13:10:23,158 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step concatenate
2019-01-29 13:10:23,192 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_metadata
2019-01-29 13:10:23,748 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step extract_time
2019-01-29 13:10:23,778 UTC [19173] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_data

resource usage:
----------------
Date and time (UTC)     Real time (s)   CPU time (s)    CPU (%) Memory (GB)     Memory (%)      Disk read (GB)  Disk write (GB)
2019-01-29 13:10:22.471385      1.1     3.6     0       0.1     0       0.005   0.0
2019-01-29 13:10:23.552753      2.2     4.6     92      0.1     0       0.005   0.0
2019-01-29 13:10:24.658557      3.3     5.4     66      0.2     0       0.005   0.0
2019-01-29 13:10:25.728394      4.4     6.3     84      0.2     1       0.005   0.0
2019-01-29 13:10:26.796565      5.4     7.0     66      0.3     1       0.005   0.0
...
2019-01-29 13:10:47.331882      26.0    20.7    77      1.9     6       0.005   0.0
2019-01-29 13:10:48.390630      27.0    21.7    95      2.3     7       0.005   0.0
2019-01-29 13:10:49.496519      28.1    22.7    85      2.9     9       0.005   0.0
2019-01-29 13:10:50.592636      29.2    23.7    92      3.1     10      0.005   0.0
2019-01-29 13:10:51.818237      30.4    24.8    92      1.8     6       0.005   0.0
2019-01-29 13:10:52.856665      31.5    25.1    33      1.9     6       0.005   0.0
...
2019-01-29 13:12:29.858572      128.5   99.2    66      9.4     30      0.005   0.0
2019-01-29 13:12:30.912098      129.5   100.0   74      9.7     31      0.005   0.0
2019-01-29 13:12:31.949134      130.6   101.0   94      10.0    32      0.005   0.0
2019-01-29 13:12:32.991596      131.6   101.9   92      9.7     31      0.005   0.0
2019-01-29 13:12:34.046581      132.7   103.0   101     10.1    32      0.005   0.0
2019-01-29 13:12:35.140844      133.8   104.1   94      9.7     31      0.005   0.0
2019-01-29 13:12:36.206012      134.8   105.1   100     9.7     31      0.005   0.0
2019-01-29 13:12:37.245032      135.9   106.1   99      9.7     31      0.005   0.0

Traceback:
----------
2019-01-29 13:12:37,879 UTC [19173] ERROR   esmvaltool.preprocessor:191 Failed to run fix_data(sea_water_potential_temperature / (K) (time: 660; depth: 40; latitude: 216; longitude: 360)
     Dimension coordinates:
          time                             x           -             -               -
          depth                            -           x             -               -
          latitude                         -           -             x               -
          longitude                        -           -             -               x
     Auxiliary coordinates:
          day_of_month                     x           -             -               -
          day_of_year                      x           -             -               -
          month_number                     x           -             -               -
          year                             x           -             -               -
     Attributes:
          Conventions: CF-1.4
          associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_ocean_fx_HadGEM2-ES_historical_r0i0p0.nc...
          branch_time: 0.0
          cmor_version: 2.7.1
          contact: chris.d.jones@metoffice.gov.uk, john.hughes@metoffice.gov.uk
          experiment: historical
          experiment_id: historical
          forcing: GHG, Oz, SA, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFCs)
          frequency: mon
          initialization_method: 1
          institute_id: MOHC
          institution: Met Office Hadley Centre, Fitzroy Road, Exeter, Devon, EX1 3PB, UK, (h...
          mo_runid: ajhoh
          model_id: HadGEM2-ES
          modeling_realm: ocean
          original_name: mo: m02s00i101 + 273.15
          parent_experiment: pre-industrial control
          parent_experiment_id: piControl
          parent_experiment_rip: r1i1p1
          physics_version: 1
          product: output
          project_id: CMIP5
          realization: 1
          references: Bellouin N. et al, (2007) Improved representation of aerosols for HadGEM2....
          source: HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); ocean: HadGOM2 (lat: 1.0-0.3...
          table_id: Table Omon (27 April 2011) 340eddd4fd838d90fa9ffe1345ecbd73
          title: HadGEM2-ES model output prepared for CMIP5 historical
     Cell methods:
          mean: time, {'project': 'CMIP5', 'dataset': 'HadGEM2-ES', 'short_name': 'thetao', 'cmor_table': 'CMIP5', 'mip': 'Omon'})
2019-01-29 13:12:38,247 UTC [19173] ERROR   esmvaltool._main:238 Program terminated abnormally, see stack trace below for more information
Traceback (most recent call last):
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_main.py", line 230, in run
    conf = main(args)
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_main.py", line 158, in main
    process_recipe(recipe_file=recipe, config_user=cfg)
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_main.py", line 208, in process_recipe
    recipe.run()
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_recipe.py", line 1060, in run
    self.tasks, max_parallel_tasks=self._cfg['max_parallel_tasks'])
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_task.py", line 583, in run_tasks
    _run_tasks_sequential(tasks)
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_task.py", line 594, in _run_tasks_sequential
    task.run()
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_task.py", line 224, in run
    self.output_files = self._run(input_files)
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 393, in _run
    product.apply(step, self.debug)
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 260, in apply
    self.cubes = preprocess(self.cubes, step, **self.settings[step])
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 206, in preprocess
    result.append(_run_preproc_function(function, item, settings))
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 189, in _run_preproc_function
    return function(items, **kwargs)
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/cmor/fix.py", line 145, in fix_data
    checker(cube).check_data()
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/cmor/check.py", line 167, in check_data
    self._check_data_range()
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/cmor/check.py", line 233, in _check_data_range
    if self._cube.data.min() < valid_min:
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/numpy/ma/core.py", line 5587, in min
    result = self.filled(fill_value).min(
  File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/numpy/ma/core.py", line 3718, in filled
    result = self._data.copy('K')
MemoryError
2019-01-29 13:12:38,286 UTC [19173] INFO    esmvaltool._main:240 If you suspect this is a bug or need help, please open an issue on https://github.com/ESMValGroup/ESMValTool/issues and attach the run/recipe_*.yml and run/main_log_debug.txt files from the output directory.

the run doesn't even get into the meat of the preprocessor, it spends about 2.5 minutes inside fix_data while the memory is increasing almost linearily with time while in that module, starting at a mere 300M and finally reaching about 10GB when jasmin-sci2 finally pulls the plug and exits my process; note that I used single-threading

valeriupredoi commented 5 years ago

also FYI @jvegasbsc also, pls stop dissing multimodel - that module is probably one of the best modules that is optimized for large data volumes :grin:

valeriupredoi commented 5 years ago

also, this memory spill does not scale with number of variables (same behaviour for 5, 2 and single variable), which statement is pretty obvious (since the data fix in a single var cube is consuming loads of memory) but worth reporting

valeriupredoi commented 5 years ago

and fortunately getting onto jasmin-sci3 (that has 2TB of RAM (!!!) I amanged to run this model to completion:

Running the same example but on sci3 (crapload of RAM)
-------------------------------------------------------

steps:
------
Date and time (UTC)     Real time (s)   CPU time (s)    CPU (%) Memory (GB)     Memory (%)      Disk read (GB)  Disk write (GB)
2019-01-29 13:48:05.416822      2.8     13.1    0       0.1     0       0.279   0.0

2019-01-29 13:48:02,854 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_file
2019-01-29 13:48:02,863 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step load
2019-01-29 13:48:03,629 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_metadata
2019-01-29 13:48:04,690 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step concatenate
2019-01-29 13:48:04,719 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_metadata
2019-01-29 13:48:05,301 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step extract_time
2019-01-29 13:48:05,365 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_data
2019-01-29 13:48:10.626232      8.1     18.5    102     0.1     0       0.279   0.0
2019-01-29 14:05:35.937827      1053.4  1150.4  105     17.3    1       0.279   0.0
2019-01-29 14:05:37.044060      1054.5  1151.5  98      9.7     0       0.279   0.0
2019-01-29 14:05:36,898 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step average_volume
2019-01-29 14:05:37.044060      1054.5  1151.5  98      9.7     0       0.279   0.0
2019-01-29 14:12:47.022805      1484.5  1581.8  100     19.3    1       0.279   0.0
stays constant at 19.3G
2019-01-29 14:22:12.721874      2050.2  2194.9  110     19.3    1       0.279   0.0
2019-01-29 14:22:14.639604      2052.1  2196.8  103     0.1     0       0.279   0.0
2019-01-29 14:22:14,484 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data
2019-01-29 14:22:14,489 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step save
2019-01-29 14:22:14,542 UTC [11773] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cleanup

so the two big culprits: fix_data at 17GB and average_volume at 19GB No wonder this was dying out of memory

valeriupredoi commented 5 years ago

it is important to define a data vs process ratio (R): the data comes in 6 files = 5 x 1.5G + 1 x 1G = 8.5G of netCDF data; fix_data has R = 2 and average_volume has an R = 2.3 so this means to the very least that both these modules work with two copies of the data which is very inefficient!

valeriupredoi commented 5 years ago

and this R is constant ie the memory intake vs used data behaviour is scalable: I ran the same test with half the data (1977-2004):

and now...
Run with half data
-------------------
Date and time (UTC)     Real time (s)   CPU time (s)    CPU (%) Memory (GB)     Memory (%)      Disk read (GB)  Disk write (GB)
2019-01-29 15:17:11.356386      6.3     23.5    0       0.1     0       0.0     0.0
2019-01-29 15:17:14.291903      9.2     26.3    95      0.1     0       0.0     0.0

2019-01-29 15:17:05,361 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_file
2019-01-29 15:17:05,367 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step load
2019-01-29 15:17:05,864 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_metadata
2019-01-29 15:17:06,687 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step concatenate
2019-01-29 15:17:06,799 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_metadata
2019-01-29 15:17:07,223 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step extract_time
2019-01-29 15:17:07,269 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step fix_data
2019-01-29 15:30:20.306300      795.2   831.8   108     8.9     0       0.0     0.0
2019-01-29 15:30:21.396659      796.3   832.5   62      5.0     0       0.0     0.0
2019-01-29 15:30:20,810 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step average_volume
constant at 9.9G
2019-01-29 15:34:08,468 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data
2019-01-29 15:34:08,471 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step save
2019-01-29 15:34:08,505 UTC [42359] DEBUG   esmvaltool.preprocessor:197 Running preprocessor step cleanup
valeriupredoi commented 5 years ago

note that the memory transferred between the two modules is 9.7G and 5.0G respectively which is slightly more than the equivalent total amount of data as loaded from disk (extra: some fx files, python modules etc.) so this is normal OK

valeriupredoi commented 5 years ago

I also ran with a simple two-model setup so I can test the multimodel module (no average_volume this time, just multimodel: mean in my preprocessor): the datas are about 5 + 7 = 12GB and sure enough fix_data used 9GB then 14GB of RAM and the multimodel had a peak of 25GB which gives an R index of R ~ 2.1 so yeah, really, multimodel is not great but it could be much worse :grin: The problem with multimodel is when we need to churn 20 models each worth 100 years

valeriupredoi commented 5 years ago

we can actually establish a very rough O(order of magnitude) behavior of the total memory needed by ESMValTool wrt input netCDF data: take for example fix_data:

So for a preprocessor module that has an R ~ 2 and is not multimodel (fix_data and average_volume are about the same give or take): maxRAM = R x F + (N - 1) x F = (R + N) x F - F where F is the average file size (in GB), R is the module efficiency multiplier factor and N is the total number of files

for multimodel it's a bit better than expected because maxRAM = 2 x F x R + (N - 2) x F = (2R + N) x F - 2F

so for a typical R = 2, F = 1GB, N = 4 you'd epect a preprocessor module to take maxRAM = 5GB and for multimodel maxRAM = 6GB

not too bad!

valeriupredoi commented 5 years ago

note that the only bit that we can optimize given these equations is R - the module's computational efficiency and even if, in the ideal case, R = 0 we still have a max memory limited by the number of files N and the average file size F:

maxRAM_ideal = (N - 1) x F for a single file module and maxRAM_ideal = (N - 2) x F for a multimodel module

valeriupredoi commented 5 years ago

we can generalize these equations for the cases that the preprocessing module receives data that has been chopped (level extraction, area/volume subsetting) and if the data is not fully realized:

Ms = (R + N) x F_eff - F_eff
Mm = (2R + N) x F_eff - 2F_eff

where

Ms: maximum memory for non-multimodel module
Mm: maximum memory for multimodel module
R: computational efficiency of module; R is typically 2-3
N: number of datasets
F_eff: average size of data per dataset where F_eff = e x f x F
where e is the factor that describes how lazy the data is (e = 1 for fully realized data)
and f describes how much the data was shrunk by the immediately previous module eg
time extraction, area selection or level extraction; note that for fix_data f relates only to the time extraction, if data is exact in time (no time selection) f = 1 for fix_data

so for cases when we deal with a lot of datasets (R + N ~ N), data is fully realized, assuming an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require

Ms = 1.5 x (N - 1) GB
Mm = 1.5 x (N - 2) GB

so interestingly enough, the multimodel will actually require less maximum memory than a non-multimodel module (not by much given that N is large, but yeah) :smiley:

valeriupredoi commented 5 years ago

@ledm you may wanna try with this https://github.com/ESMValGroup/ESMValTool/pull/816 to improve your multimodel life when it's got merged

valeriupredoi commented 5 years ago

having

Ms = (R + N) x F_eff - F_eff
Mm = (2R + N) x F_eff - 2F_eff

we can use R = 3 to set an upper limit on the maximum memory any given preprocessor should take so we can detect memory leaks ie a memory leak doesn't necessarily mean memory intake going through the roof, a lot of bad coding may lead to memory intakes of 30-60% more than what's actually needed so we can detect this fairly robustly

valeriupredoi commented 5 years ago

can we close this? @ledm

valeriupredoi commented 5 years ago

added the memory equations in documentation in #867 - apart from that we can't really do much more about this issue

bouweandela commented 5 years ago

Please keep it open, because it's not fixed. What we can and should do, once we've switched to iris 2, is go over all preprocessor functions carefully and try to change them so they work with dask arrays instead of numpy arrays. This should solve most memory usage issues.

valeriupredoi commented 5 years ago

can do, was thinking about creating a new issue about that, slightly cleaner ie you dont have to scroll all the way down past all sorts of theoretical gibberish

mattiarighi commented 5 years ago

This looks like a duplicate of #32, can we close?

bouweandela commented 5 years ago

I think not, there are more ways to reduce memory usage than using iris and dask. Some things will have to be implemented in iris too, e.g. lazy regridding/vertical level interpolation. For multimodel statististics we could consider saving data to disk more to reduce memory use, though that will be another performance problem.

npgillett commented 4 years ago

It looks like someone has optimised the memory usage of the preprocessor in the last few months - my code with ~1500 input files used to run out of memory and crash when I ran in one go in the spring, now it works! Thanks!

schlunma commented 1 year ago

It looks like most of the issues discussed here (lazy regridding, lazy vertical level interpolation, lazy multimodel statistics, an overview of memory usage, etc.) has been fixed in previous releases. Remaining problems can be discussed in https://github.com/ESMValGroup/ESMValCore/issues/674 or in dedicated issues.