ClimateImpactLab / downscaleCMIP6

Downscaling & bias correction of CMIP6 tasmin, tasmax, and pr for the R/CIL GDPCIR project
MIT License
136 stars 33 forks source link

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

Closed emileten closed 2 years ago

emileten commented 2 years ago

Blocks progress on #263 and #266.

Workflow : https://argo.cildc6.org/archived-workflows/default/28a83ec8-998f-4d61-96f1-73f57387d3e7

One can look at the standardize_gcm step, in cleaning : each and every retry failed due to OOM errors.

I picked one of these failed pods input and reproduced the OOM on JupyterHub with a 48GB server, which is the specified resource limit in this pod in our argo workflow.

In standardize_gcm we load the data in memory. EC-Earth3 pr is 256 512 time, so higher resolution than other models, but that's still only ~16GBs for the future data. The problem is that we have operations in standardize_gcm that make the memory usage kind of blow up, I think.

Another model of this family, with a lower resolution, that is running here https://argo.cildc6.org/workflows/default/e2e-ec-earth3-veg-lr-pr-t8stn?tab=workflow&nodeId=e2e-ec-earth3-veg-lr-pr-t8stn-1904431442, nearly crashed at the same steps for the same reason, but survived thanks to retries.

emileten commented 2 years ago

An additional detail. In standardize_cmip6, the two culprits are :

  1. The precip unit conversion : ds_cleaned['pr'] * 24 * 60 * 60
  2. xclim_remove_leapdays(ds_cleaned)

If we're willing to spend time on this, I see one acceptable option only. Split the standardize_cmip6 step so that argo works on a few spatial chunks. We'd also avoid changing anything to dodola. standardize_cmip6 is spatial-independent so that would be fine.

Two other options that won't work are : increasing the resource limits or restructuring standardize_cmip6 in dodola. The former won't actually work as the probem is too severe, same for the latter which on top of it implies a lot of re-write.

Note that fixing this issue would allow us to let in 4 models data from this consortium.

[Edit : updated some information and clarified]

emileten commented 2 years ago

Oh ok. I think I understand better what happened here.

The only step requiring the absence of temporal chunks is the 360 days calendar conversion, though. Therefore, I am suggesting we move the data loading to that specific location of the code. It's a super easy change and it fixes the backward compatibility of that breaking PR. The only downside is that it introduces chunking concerns in dodola.core. We already have some there, though...

emileten commented 2 years ago

Like I expected, two additional EC-Earth models failed due to this (EC-Earth3-AerChem and EC-Earth3-CC)