Memory blowout on large domains

joelfiddes commented 1 year ago

Im still getting the memory blowout even with IO:split option turned off and for a single year. This is admittedly for a big domain but one I ran no problem several versions ago (pre implementation of IO:split). What confuses me is that if split is turned off then it should be the same as older version, no? Yet something seems to be fundamentally different in memory use, it blows through 15gb of memory in about 15s! S something is scaling up pretty fast......

ArcticSnow commented 1 year ago

Oh no sorry. Can you confirm that the version prior to commit e30c720693fc489ff21960cff7490f950ef0a23a does not show the same behavior. Using split or not does not influence the usage of multithreading and multicore. Are you able to pinpoint at what step in the code you witness the leak? FYI, I am currently running a project of 4000 clusters split on 5years for a total of 70 years on a server. It has been running smoothly over multiple days. I am quite puzzled about this problem.

How big is the DEM file? I may try with a large DEM too, that I can reproduce the bug.

joelfiddes commented 1 year ago

Ive gone back to release 0.1.7 and running fine now. I think as before this is not related to cluster number or length but ERA5 domain size. I dont think it is dependent on DEM size as last use case (Naryn KG) that we saw this was a big ERA5 domain with small dem. Once I cropped ERA5 to the dem it was fine. Here ERA5 domain is 21 x 17 and dem is 625 x 555 (500m cells). Are you interested in specifically the version prior to commit https://github.com/ArcticSnow/TopoPyScale/commit/e30c720693fc489ff21960cff7490f950ef0a23a ?

ArcticSnow commented 1 year ago

That commit is prior to the merge with the parallelizing branch. I wonder if it has to do with the changing opening the era5 data from open_mfdataset() with a xr.concat() of a list of filename.

ealonsogzl commented 1 year ago

Hey guys, I was reading this thread. There are some memory leaks reported here and there with netcdf at the C level, so the python gc can not handle them. I personally found one some time ago in MFDataset() that forced me to open the files with a loop instead. Try to keep an updated version of the netcdf dependencies you are using, maybe that helps.

This may be relevant https://github.com/pydata/xarray/issues/3200

joelfiddes commented 1 year ago

I think the issue is here:

def _open_dataset_climate(flist):

    ds__list = []
    for file in flist:
        ds__list.append(xr.open_dataset(file))

where all the era5 files are loaded and appended. In my case with large domain this blows the memory. How can we make this scale?

joelfiddes commented 1 year ago

above is l.149 of topo_scale.py

joelfiddes commented 1 year ago

i think this is why it occurs if you specify the toime split option or not as in both cases def downscale_climate is the same and containsthe function above.

Basically what you already said above Simon, I think: https://github.com/ArcticSnow/TopoPyScale/issues/67#issuecomment-1480739691

joelfiddes commented 1 year ago

@ArcticSnow can we go back to using open_mfdataset() or is that not working with the new split timeseries code?

joelfiddes commented 1 year ago

new split timeseries code

In [19]: flist = flist_PLEV

In [20]:     ds__list = []
    ...:     for file in flist:
    ...:          ds__list.append(xr.open_dataset(file))
    ...: 
    ...:     ds_ = xr.concat(ds__list, dim='time')

In [21]: ds_
Out[21]: 
<xarray.Dataset>
Dimensions:    (time: 1464, longitude: 24, latitude: 11, level: 8)
Coordinates:
  * time       (time) datetime64[ns] 1999-09-01 ... 1999-10-31T23:00:00
  * longitude  (longitude) float32 72.9 73.15 73.4 73.65 ... 78.15 78.4 78.65
  * latitude   (latitude) float32 42.55 42.3 42.05 41.8 ... 40.55 40.3 40.05
  * level      (level) float64 300.0 500.0 600.0 700.0 800.0 850.0 900.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 9.28e+04 ... 2.27e+03
    t          (time, level, latitude, longitude) float32 235.7 235.7 ... 286.5
    u          (time, level, latitude, longitude) float32 19.76 19.63 ... 0.6399
    v          (time, level, latitude, longitude) float32 11.27 11.9 ... -0.6594
    r          (time, level, latitude, longitude) float32 37.86 51.65 ... 63.77
    q          (time, level, latitude, longitude) float32 0.0001289 ... 0.003548
Attributes:
    CDI:          Climate Data Interface version 1.9.9rc1 (https://mpimet.mpg...
    Conventions:  CF-1.6
    history:      Thu Mar 09 22:37:48 2023: cdo sellonlatbox,72.6960777169421...
    CDO:          Climate Data Operators version 1.9.9rc1 (https://mpimet.mpg...

Original open_mfdataset:

In [24]: ds_plev = xr.open_mfdataset(project_directory + 'inputs/climate/PLEV*.nc', parallel=True)

In [25]: ds_plev
Out[25]: 
<xarray.Dataset>
Dimensions:    (time: 1464, longitude: 24, latitude: 11, level: 8)
Coordinates:
  * time       (time) datetime64[ns] 1999-09-01 ... 1999-10-31T23:00:00
  * longitude  (longitude) float32 72.9 73.15 73.4 73.65 ... 78.15 78.4 78.65
  * latitude   (latitude) float32 42.55 42.3 42.05 41.8 ... 40.55 40.3 40.05
  * level      (level) float64 300.0 500.0 600.0 700.0 800.0 850.0 900.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    r          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    q          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
Attributes:
    CDI:          Climate Data Interface version 1.9.9rc1 (https://mpimet.mpg...
    Conventions:  CF-1.6
    history:      Thu Mar 09 22:37:48 2023: cdo sellonlatbox,72.6960777169421...
    CDO:          Climate Data Operators version 1.9.9rc1 (https://mpimet.mpg...

Results seem to be identical - a change back should work?

joelfiddes commented 1 year ago

another small point why is the time subset done in SURF but not PLEV? Before it was both:

joelfiddes commented 1 year ago

edits :

and

seems to work so far

ArcticSnow commented 1 year ago

I had changed open_mfdataset() to the other method as open_mfdataset() did not work with the parallelizing system I endup using. So we could have both options then. Can you check if when using your edit you can still spread the computational load on multicore?

Good catch on the line 189. No idea why this happened, strange.

ArcticSnow commented 1 year ago

Ok, I tested with the edits above, and now it does not parallelize anymore the downscaling. I'll include it again and add an option, then.

joelfiddes commented 1 year ago

Im getting the number of points jobs launched as cores specified (6) and a shed load of processes launched during downscaling - do you mean you just have a single process running?

ArcticSnow commented 1 year ago

nevermind, I had my config file set for one core. Sorry, it works great!

Should we close the topic then?

ArcticSnow / TopoPyScale

Memory blowout on large domains #67