Possibly running out of memory when using a lot of preprocessors.

malininae commented 1 year ago

Hi all,

In the July 2022 Meeting, I raised an issue, that I can't process the data with a lot of preprocessors (see https://github.com/ESMValGroup/Community/discussions/33#discussioncomment-3121419). When I tried it back there, it all seemed to work just fine, however, I finally ran into a problem.

I was running this recipe (it looks ugly, since it's still being developed, so please no judgement), and it works OK for the groups 'all', 'obs_abs' and 'obs_ano', but it can't process 'nat' and 'ssp245'. I thought, OK, I split the recipe and process separately 'nat' and 'ssp245' in their own recipes and recombine them for the diagnostic, but no, I seem to not be able to process them either. The computers I am working on allow only 6h jobs, but there's not much limitations for the number of the checked out processors. For this job, I checked out 27 cpus and allocated 180Gb of memory, I don't allow more than one process per cpu. (I also tried 20, and it didn't work.) My hunch is, that the more fine resolution models are not processed and just keep hanging.

A small note, I think the issue here is my preprocessor rolling_window_statistics, because, I think, the iris function it uses realizes data. I processed everything quite well without rolling_window_statistics on 20 cpus.

I'm not sure what's the best way of handling that here. For now, I will process the data as I did it before, but someone might be interested in it. main_log_debug_tx3x.txt

Here are the computers specifications if that matters: hpcs

malininae commented 1 year ago

A small update, I decided to be creative, and shortened my data preprocessor to:

  preproc_txx: &prep_txx
    custom_order: True
    regrid:
      target_grid: 0.5x0.5
      lon_offset: 0.25
      lat_offset: 0.25
      scheme: linear    
    extract_shape:
      shapefile: british_columbia.shp
      method: contains
      crop: True
    area_statistics:
      operator: mean
    rolling_window_statistics:
      coordinate: time
      operator: mean
      window_length: 3
    annual_statistics:
      operator: max

and using the data over the 20-50 year chunks I was indeed interested in, instead 'overflown' data for 100+ years in order to subtract anomaly. To subtract anomaly I just added an extra group anomaly, which basically is a mean of the reference period, and I simply subtract that mean in the diagnostic. That approach worked like a charm. I processed the data within 3 hours with the same computers and same requested resources. For a different recipe, I used the not working preprocessor, but only on 73 year data, and it works just fine. The problem seems to be the length of the data row, namely, somewhere between 80 and 120 years data becomes too much to handle.

I was thinking, if one subtracts anomalies with the period far away from the area of interest, may be one could re-evaluate the anomalies preprocessor, so one doesn't need to process 120 years when one needs anyway only 40 years (20 in the beginning and 20 in the end of the whole period) out of it? Because otherwise one chokes poor computers for nothing.

valeriupredoi commented 1 year ago

@malininae one quick question before I find more time to look deeper into this - why don't you extract_shape before regridding? That would shrink the data on the xy plane and the finely-gridded regridding will have an easier time

bouweandela commented 10 months ago

why don't you extract_shape before regridding? That would shrink the data on the xy plane and the finely-gridded regridding will have an easier time

Unfortunately, that wouldn't help because the target grid constructed by the regrid preprocessor always covers the globe completely because it has no way of knowing which bit of it you want.

malininae commented 10 months ago

why don't you extract_shape before regridding? That would shrink the data on the xy plane and the finely-gridded regridding will have an easier time

Unfortunately, that wouldn't help because the target grid constructed by the regrid preprocessor always covers the globe completely because it has no way of knowing which bit of it you want.

Plus, since I am doing extremes the edges are super important for me.

malininae commented 10 months ago

Following the discussion during the November 2023 monthly meeting, here's the monster recipe which I was talking about. I'm analyzing high-resolution HighResMIP sub-daily data. I have an access to the monster computer with up to 445Gb memory per processes, however I have only 3 hours per process. What I ended up doing, I ran each of the variable groups in a separate recipe, since it takes a 2:26 hours and astonishing 248Gb to run a variable group. You can see an example of a log file for one of the groups here. I tried combining the outputs using the --resume_from option but since the recipes were slightly different, I got an error. I ended up just manually modifying my settings.yml.

To answer some questions upfront, the wind derivation is done lazily, I double checked when I created the function. The only option I see in easing the wind derivation, create separate variable uas_sq = uas**2 and vas_sq=vas**2. Here is the link for the pull request with the branch with the wind derivation function.

bouweandela commented 10 months ago

Thanks for sharing, I'll have a look! Could you also upload the shapefile here so I can run the recipe?

malininae commented 10 months ago

Thanks for sharing, I'll have a look! Could you also upload the shapefile here so I can run the recipe?

Oops, sorry! I couldn't attach it here, I put it into my google drive and set the permissions so everyone with the link can view it, let me know if you can't access it.

bouweandela commented 9 months ago

What I ended up doing, I ran each of the variable groups in a separate recipe, since it takes a 2:26 hours and astonishing 248Gb to run a variable group. You can see an example of a log file for one of the groups here.

I would recommend using the Dask Distributed scheduler to run this recipe and set max_parallel_tasks: 1 in config-user.yml. That will give you more control over how much memory Dask uses and will likely also run faster. Could you give that a try and let me know how it goes?

A bit of background to that advice: I ran the variable group wind_fut from the attached log file on Levante and the job completed in about 40 minutes. Here are the settings that I used:

SLURM job script

#!/bin/bash -l 

#SBATCH --partition=interactive
#SBATCH --time=08:00:00 
#SBATCH --mem=32G

set -eo pipefail 
unset PYTHONPATH 

. /work/bd0854/b381141/mambaforge/etc/profile.d/conda.sh
conda activate esmvaltool

esmvaltool run recipe_extremes_wind_3h.yml

and the following ~/.esmvaltool/dask.yml file:

cluster:
  type: dask_jobqueue.SLURMCluster
  queue: compute
  cores: 128
  memory: 256GiB
  processes: 32
  interface: ib0
  local_directory: /scratch/b/b381141/dask-tmp
  n_workers: 32
  walltime: '8:00:00'

Considering that the input data is about 700 GB, that means we are processing about 300 MB / second. I profiled the run with py-spy and it looks like half of the time is spent on data crunching and the other half is spent on loading the iris cubes from file and some other things. Thus, using more workers in the Dask cluster may make it faster, provided that the disk you are loading the data from can go faster. There is probably some room to improve the non-data crunching parts too, but that will require changes in iris. The profiling result is available here, you can load the file into speedcope.app if you're interested. The 'save' function is where the computation on the Dask cluster happens.

ESMValGroup / ESMValCore

Possibly running out of memory when using a lot of preprocessors. #1915