Closed ledm closed 1 year ago
Thank you for opening this issue. Can you please attach the precise recipe you are using to this issue and if possible, specify how much memory it requires or how much it uses when it crashes?
This issue was raised in pr #763 and was also raised by Rob Parker, from the UKESM project.
Please don't mix things up, Rob was complaining about a certain memory issue that is completely inexplicable given that the recipe he ran (the exact settings of it) was run by both Ranjini and me and no memory issues were found. It's a trivial recipe with trivial computations (no multimodel or any other memory-heavy operations).
I am starting to look at this specific recipe recipe_ocean_bgc.yml
right now, can you pls tell me what are the recipe settings I should use to reproduce the memory issue?
The second problem is that the warning messages and error messages related to Memory problems in ESMValTool are unclear. It's not obvious which part of this recipe causes the memory problems.
This is easily diagnosable if you run in debug mode, when MemoryError
stopes the run just backtrack in stdout and see what was the last completed step; OK, it is broadly diagnosable since the actual operation that ran out of memory is hard to find but at least you'll know what module produced the memory spill
Change the dataset line to:
- {dataset: HadGEM2-ES, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 1950, end_year: 2004}
ie change start_year: 2001,
to start_year: 1950
and you should start to see some high memory usage.
cool, cheers, man! will have a play now
can you pls advise where to get the WOA obs4mips files from on jasmin? Also, are you using the obs4mips in the multimodel? because if not, having just a single CMIP5 dataset means no multimodel computations will be done
I can copy over the WOA files now. Where shall I put them?
Is this OBS
or obs4mips
?
If OBS
, it would be good if you could also provide the cmorize_obs
script used to generate them.
Obs, I would guess. Whats the difference? Happy to provide the script, but it's not pretty.
it's obs4mips, but not playing any role in this issue actually, found the memory leak...hold on will post the result
OK guys, found the memory thief: I used Lee's recipe that I trimmed heavily but kept the 1950-2004 data range...and the bugger is fix_data
; here's the postmortem:
Recipe:
------
datasets:
- {dataset: HadGEM2-ES, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 1950, end_year: 2004}
preprocessors:
# Global 3D Volume-weighted Average
prep_timeseries_global_volume_average:
custom_order: true
average_volume:
coord1: longitude
coord2: latitude
coordz: depth
use_fx_files: true
diagnostics:
diag_timeseries_volume_average_vs_WOA:
description: Global volume average time series (WOA observations)
variables:
thetao: # Temperature 3D
preprocessor: prep_timeseries_global_volume_average
field: TO3M
mip: Omon
fx_files: [volcello,]
scripts: Null
steps:
------
2019-01-29 13:10:21,599 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_file
2019-01-29 13:10:21,612 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step load
2019-01-29 13:10:22,341 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_metadata
2019-01-29 13:10:23,158 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step concatenate
2019-01-29 13:10:23,192 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cmor_check_metadata
2019-01-29 13:10:23,748 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step extract_time
2019-01-29 13:10:23,778 UTC [19173] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_data
resource usage:
----------------
Date and time (UTC) Real time (s) CPU time (s) CPU (%) Memory (GB) Memory (%) Disk read (GB) Disk write (GB)
2019-01-29 13:10:22.471385 1.1 3.6 0 0.1 0 0.005 0.0
2019-01-29 13:10:23.552753 2.2 4.6 92 0.1 0 0.005 0.0
2019-01-29 13:10:24.658557 3.3 5.4 66 0.2 0 0.005 0.0
2019-01-29 13:10:25.728394 4.4 6.3 84 0.2 1 0.005 0.0
2019-01-29 13:10:26.796565 5.4 7.0 66 0.3 1 0.005 0.0
...
2019-01-29 13:10:47.331882 26.0 20.7 77 1.9 6 0.005 0.0
2019-01-29 13:10:48.390630 27.0 21.7 95 2.3 7 0.005 0.0
2019-01-29 13:10:49.496519 28.1 22.7 85 2.9 9 0.005 0.0
2019-01-29 13:10:50.592636 29.2 23.7 92 3.1 10 0.005 0.0
2019-01-29 13:10:51.818237 30.4 24.8 92 1.8 6 0.005 0.0
2019-01-29 13:10:52.856665 31.5 25.1 33 1.9 6 0.005 0.0
...
2019-01-29 13:12:29.858572 128.5 99.2 66 9.4 30 0.005 0.0
2019-01-29 13:12:30.912098 129.5 100.0 74 9.7 31 0.005 0.0
2019-01-29 13:12:31.949134 130.6 101.0 94 10.0 32 0.005 0.0
2019-01-29 13:12:32.991596 131.6 101.9 92 9.7 31 0.005 0.0
2019-01-29 13:12:34.046581 132.7 103.0 101 10.1 32 0.005 0.0
2019-01-29 13:12:35.140844 133.8 104.1 94 9.7 31 0.005 0.0
2019-01-29 13:12:36.206012 134.8 105.1 100 9.7 31 0.005 0.0
2019-01-29 13:12:37.245032 135.9 106.1 99 9.7 31 0.005 0.0
Traceback:
----------
2019-01-29 13:12:37,879 UTC [19173] ERROR esmvaltool.preprocessor:191 Failed to run fix_data(sea_water_potential_temperature / (K) (time: 660; depth: 40; latitude: 216; longitude: 360)
Dimension coordinates:
time x - - -
depth - x - -
latitude - - x -
longitude - - - x
Auxiliary coordinates:
day_of_month x - - -
day_of_year x - - -
month_number x - - -
year x - - -
Attributes:
Conventions: CF-1.4
associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_ocean_fx_HadGEM2-ES_historical_r0i0p0.nc...
branch_time: 0.0
cmor_version: 2.7.1
contact: chris.d.jones@metoffice.gov.uk, john.hughes@metoffice.gov.uk
experiment: historical
experiment_id: historical
forcing: GHG, Oz, SA, LU, Sl, Vl, BC, OC, (GHG = CO2, N2O, CH4, CFCs)
frequency: mon
initialization_method: 1
institute_id: MOHC
institution: Met Office Hadley Centre, Fitzroy Road, Exeter, Devon, EX1 3PB, UK, (h...
mo_runid: ajhoh
model_id: HadGEM2-ES
modeling_realm: ocean
original_name: mo: m02s00i101 + 273.15
parent_experiment: pre-industrial control
parent_experiment_id: piControl
parent_experiment_rip: r1i1p1
physics_version: 1
product: output
project_id: CMIP5
realization: 1
references: Bellouin N. et al, (2007) Improved representation of aerosols for HadGEM2....
source: HadGEM2-ES (2009) atmosphere: HadGAM2 (N96L38); ocean: HadGOM2 (lat: 1.0-0.3...
table_id: Table Omon (27 April 2011) 340eddd4fd838d90fa9ffe1345ecbd73
title: HadGEM2-ES model output prepared for CMIP5 historical
Cell methods:
mean: time, {'project': 'CMIP5', 'dataset': 'HadGEM2-ES', 'short_name': 'thetao', 'cmor_table': 'CMIP5', 'mip': 'Omon'})
2019-01-29 13:12:38,247 UTC [19173] ERROR esmvaltool._main:238 Program terminated abnormally, see stack trace below for more information
Traceback (most recent call last):
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_main.py", line 230, in run
conf = main(args)
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_main.py", line 158, in main
process_recipe(recipe_file=recipe, config_user=cfg)
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_main.py", line 208, in process_recipe
recipe.run()
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_recipe.py", line 1060, in run
self.tasks, max_parallel_tasks=self._cfg['max_parallel_tasks'])
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_task.py", line 583, in run_tasks
_run_tasks_sequential(tasks)
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_task.py", line 594, in _run_tasks_sequential
task.run()
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/_task.py", line 224, in run
self.output_files = self._run(input_files)
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 393, in _run
product.apply(step, self.debug)
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 260, in apply
self.cubes = preprocess(self.cubes, step, **self.settings[step])
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 206, in preprocess
result.append(_run_preproc_function(function, item, settings))
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/preprocessor/__init__.py", line 189, in _run_preproc_function
return function(items, **kwargs)
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/cmor/fix.py", line 145, in fix_data
checker(cube).check_data()
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/cmor/check.py", line 167, in check_data
self._check_data_range()
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/ESMValTool-2.0a1-py3.6.egg/esmvaltool/cmor/check.py", line 233, in _check_data_range
if self._cube.data.min() < valid_min:
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/numpy/ma/core.py", line 5587, in min
result = self.filled(fill_value).min(
File "/home/users/valeriu/anaconda3/envs/esmvaltool_v2_dev/lib/python3.6/site-packages/numpy/ma/core.py", line 3718, in filled
result = self._data.copy('K')
MemoryError
2019-01-29 13:12:38,286 UTC [19173] INFO esmvaltool._main:240 If you suspect this is a bug or need help, please open an issue on https://github.com/ESMValGroup/ESMValTool/issues and attach the run/recipe_*.yml and run/main_log_debug.txt files from the output directory.
the run doesn't even get into the meat of the preprocessor, it spends about 2.5 minutes inside fix_data
while the memory is increasing almost linearily with time while in that module, starting at a mere 300M and finally reaching about 10GB when jasmin-sci2 finally pulls the plug and exits my process; note that I used single-threading
also FYI @jvegasbsc
also, pls stop dissing multimodel
- that module is probably one of the best modules that is optimized for large data volumes :grin:
also, this memory spill does not scale with number of variables (same behaviour for 5, 2 and single variable), which statement is pretty obvious (since the data fix in a single var cube is consuming loads of memory) but worth reporting
and fortunately getting onto jasmin-sci3 (that has 2TB of RAM (!!!) I amanged to run this model to completion:
Running the same example but on sci3 (crapload of RAM)
-------------------------------------------------------
steps:
------
Date and time (UTC) Real time (s) CPU time (s) CPU (%) Memory (GB) Memory (%) Disk read (GB) Disk write (GB)
2019-01-29 13:48:05.416822 2.8 13.1 0 0.1 0 0.279 0.0
2019-01-29 13:48:02,854 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_file
2019-01-29 13:48:02,863 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step load
2019-01-29 13:48:03,629 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_metadata
2019-01-29 13:48:04,690 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step concatenate
2019-01-29 13:48:04,719 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cmor_check_metadata
2019-01-29 13:48:05,301 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step extract_time
2019-01-29 13:48:05,365 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_data
2019-01-29 13:48:10.626232 8.1 18.5 102 0.1 0 0.279 0.0
2019-01-29 14:05:35.937827 1053.4 1150.4 105 17.3 1 0.279 0.0
2019-01-29 14:05:37.044060 1054.5 1151.5 98 9.7 0 0.279 0.0
2019-01-29 14:05:36,898 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step average_volume
2019-01-29 14:05:37.044060 1054.5 1151.5 98 9.7 0 0.279 0.0
2019-01-29 14:12:47.022805 1484.5 1581.8 100 19.3 1 0.279 0.0
stays constant at 19.3G
2019-01-29 14:22:12.721874 2050.2 2194.9 110 19.3 1 0.279 0.0
2019-01-29 14:22:14.639604 2052.1 2196.8 103 0.1 0 0.279 0.0
2019-01-29 14:22:14,484 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data
2019-01-29 14:22:14,489 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step save
2019-01-29 14:22:14,542 UTC [11773] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cleanup
so the two big culprits: fix_data at 17GB and average_volume at 19GB No wonder this was dying out of memory
it is important to define a data vs process ratio (R): the data comes in 6 files = 5 x 1.5G + 1 x 1G = 8.5G of netCDF data; fix_data has R = 2 and average_volume has an R = 2.3 so this means to the very least that both these modules work with two copies of the data which is very inefficient!
and this R is constant ie the memory intake vs used data behaviour is scalable: I ran the same test with half the data (1977-2004):
and now...
Run with half data
-------------------
Date and time (UTC) Real time (s) CPU time (s) CPU (%) Memory (GB) Memory (%) Disk read (GB) Disk write (GB)
2019-01-29 15:17:11.356386 6.3 23.5 0 0.1 0 0.0 0.0
2019-01-29 15:17:14.291903 9.2 26.3 95 0.1 0 0.0 0.0
2019-01-29 15:17:05,361 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_file
2019-01-29 15:17:05,367 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step load
2019-01-29 15:17:05,864 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_metadata
2019-01-29 15:17:06,687 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step concatenate
2019-01-29 15:17:06,799 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cmor_check_metadata
2019-01-29 15:17:07,223 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step extract_time
2019-01-29 15:17:07,269 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step fix_data
2019-01-29 15:30:20.306300 795.2 831.8 108 8.9 0 0.0 0.0
2019-01-29 15:30:21.396659 796.3 832.5 62 5.0 0 0.0 0.0
2019-01-29 15:30:20,810 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step average_volume
constant at 9.9G
2019-01-29 15:34:08,468 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cmor_check_data
2019-01-29 15:34:08,471 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step save
2019-01-29 15:34:08,505 UTC [42359] DEBUG esmvaltool.preprocessor:197 Running preprocessor step cleanup
note that the memory transferred between the two modules is 9.7G and 5.0G respectively which is slightly more than the equivalent total amount of data as loaded from disk (extra: some fx files, python modules etc.) so this is normal OK
I also ran with a simple two-model setup so I can test the multimodel module (no average_volume this time, just multimodel: mean in my preprocessor): the datas are about 5 + 7 = 12GB and sure enough fix_data used 9GB then 14GB of RAM and the multimodel had a peak of 25GB which gives an R index of R ~ 2.1 so yeah, really, multimodel is not great but it could be much worse :grin: The problem with multimodel is when we need to churn 20 models each worth 100 years
we can actually establish a very rough O(order of magnitude) behavior of the total memory needed by ESMValTool wrt input netCDF data: take for example fix_data:
So for a preprocessor module that has an R ~ 2 and is not multimodel (fix_data and average_volume are about the same give or take): maxRAM = R x F + (N - 1) x F = (R + N) x F - F where F is the average file size (in GB), R is the module efficiency multiplier factor and N is the total number of files
for multimodel it's a bit better than expected because maxRAM = 2 x F x R + (N - 2) x F = (2R + N) x F - 2F
so for a typical R = 2, F = 1GB, N = 4 you'd epect a preprocessor module to take maxRAM = 5GB and for multimodel maxRAM = 6GB
not too bad!
note that the only bit that we can optimize given these equations is R - the module's computational efficiency and even if, in the ideal case, R = 0 we still have a max memory limited by the number of files N and the average file size F:
maxRAM_ideal = (N - 1) x F for a single file module and maxRAM_ideal = (N - 2) x F for a multimodel module
we can generalize these equations for the cases that the preprocessing module receives data that has been chopped (level extraction, area/volume subsetting) and if the data is not fully realized:
Ms = (R + N) x F_eff - F_eff
Mm = (2R + N) x F_eff - 2F_eff
where
Ms: maximum memory for non-multimodel module
Mm: maximum memory for multimodel module
R: computational efficiency of module; R is typically 2-3
N: number of datasets
F_eff: average size of data per dataset where F_eff = e x f x F
where e is the factor that describes how lazy the data is (e = 1 for fully realized data)
and f describes how much the data was shrunk by the immediately previous module eg
time extraction, area selection or level extraction; note that for fix_data f relates only to the time extraction, if data is exact in time (no time selection) f = 1 for fix_data
so for cases when we deal with a lot of datasets (R + N ~ N), data is fully realized, assuming an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require
Ms = 1.5 x (N - 1) GB
Mm = 1.5 x (N - 2) GB
so interestingly enough, the multimodel will actually require less maximum memory than a non-multimodel module (not by much given that N is large, but yeah) :smiley:
@ledm you may wanna try with this https://github.com/ESMValGroup/ESMValTool/pull/816 to improve your multimodel life when it's got merged
having
Ms = (R + N) x F_eff - F_eff
Mm = (2R + N) x F_eff - 2F_eff
we can use R = 3 to set an upper limit on the maximum memory any given preprocessor should take so we can detect memory leaks ie a memory leak doesn't necessarily mean memory intake going through the roof, a lot of bad coding may lead to memory intakes of 30-60% more than what's actually needed so we can detect this fairly robustly
can we close this? @ledm
added the memory equations in documentation in #867 - apart from that we can't really do much more about this issue
Please keep it open, because it's not fixed. What we can and should do, once we've switched to iris 2, is go over all preprocessor functions carefully and try to change them so they work with dask arrays instead of numpy arrays. This should solve most memory usage issues.
can do, was thinking about creating a new issue about that, slightly cleaner ie you dont have to scroll all the way down past all sorts of theoretical gibberish
This looks like a duplicate of #32, can we close?
I think not, there are more ways to reduce memory usage than using iris and dask. Some things will have to be implemented in iris too, e.g. lazy regridding/vertical level interpolation. For multimodel statististics we could consider saving data to disk more to reduce memory use, though that will be another performance problem.
It looks like someone has optimised the memory usage of the preprocessor in the last few months - my code with ~1500 input files used to run out of memory and crash when I ran in one go in the spring, now it works! Thanks!
It looks like most of the issues discussed here (lazy regridding, lazy vertical level interpolation, lazy multimodel statistics, an overview of memory usage, etc.) has been fixed in previous releases. Remaining problems can be discussed in https://github.com/ESMValGroup/ESMValCore/issues/674 or in dedicated issues.
This issue was raised in pr #763 and was also raised by Rob Parker, from the UKESM project.
The recipe
recipe_ocean_bgc.yml
memory explodes when the time range is changed to a more realistic range. The time range is currently 2001-2004, but we may be interested in, for instance, the range 1950-2010.The second problem is that the warning messages and error messages related to Memory problems in ESMValTool are unclear. It's not obvious which part of this recipe causes the memory problems.