Coregistration gets stuck using more and more memory

HelenClifton commented 6 years ago

Expected behavior

Expect to be able to coregister a SST dataset (regional subsetted) with a global cloud dataset

Actual behavior

@forman @JanisGailis Coregister operation starts and uses more and more memory. Eventually laptop freezes. Only way to stop it is to cancel the coregistration operation or end the process using Task Manager. Same happens when global cloud dataset is master and is SST is slave dataset.

Steps to reproduce the problem

Start Windows Task Manager, select Processes tab, sort by Memory column in descending order
Open cate GUI and download following dataset : esacci.SST.day.L4.SSTdepth.multi-sensor.multi-platform.OSTIA.1-1.r1 Time: from 2004-01-01 to 2005-12-31 Region: lat=[-10,10], lon=[-175,-115] No variable constraints
Download following dataset esacci.CLOUD.mon.L3C.CLD_PRODUCTS.multi-sensor.multi-platform.ATSR2-AATSR.2-0.r1 Time: from 2004-01-01 to 2005-01-01 No regional constraints No variable constraints
Select coregister operation. ds_master = ds_1 (SST), ds_slave = ds_2 (cloud) method_us, method_ds : (use defaults)
Click "Add Step". Look on Task Manager. python.exe process uses more and more memory.

Note

The SST and cloud datasets were the same ones used by @kjpearson in issue #733. In that case he reported (when using cate-2.0.0-dev. 16) that the operation completed but the data for the whole globe in the cloud dataset has been remapped down to the subregion in the SST dataset.

Specifications

cate-2.0.0-dev.20 Windows 7 Professional

HelenClifton commented 6 years ago

@forman @JanisGailis @kjpearson Same problem is seen when coregistering following cloud datasets (without any regional subsetting) esacci.CLOUD.mon.L3C.CLD_PRODUCTS.multi-sensor.multi-platform.ATSR2-AATSR.2-0.r1 [2004-01-01, 2005-01-01] esacci.CLOUD.mon.L3C.CLD_PRODUCTS.AVHRR.multi-platform.AVHRR-PM.2-0.r1 [2004-01-01, 2004-05-01]

JanisGailis commented 5 years ago

@forman I have investigated this. The problem is with using the gridtools library. Apparently, dask doesn't work the way we thought it does. You can not pass a slice to something that makes a new np array and then stitch those together. dask only does the out of core processing on actual calculations being applied to a dask array.

As it currently stands, it looks like this from xarray land:

There's an xarray dataset using dask as backend.
A tiny slice of it is loaded into memory as a numpy array and thrown into a black hole (gridtools).
Out of the black hole comes a new numpy array
A new set of xr.DataArrays are constructed from these numpy arrays coming out of the black hole in memory!

There are two possible solutions I have come up with:

Rewrite coregistration without relying on gridtools. E.g., use xarray and dask native capabilities. xarray now has resampling implemented that does nearest_neighbor and bilinear resampling, which could be used for upsampling. For downsampling aggregated rolling operations across dimensions can be used to do tricky things. I've implemented a preliminary non weighed mean downsampler with it. There's an exploratory branch jg-799-coreg-memhog 1.1 xarray built in resampling doesn't know how to work across dask chunks. E.g., you can not upsample a large subset (or an unlucky subset) of a finely grained dataset, such as SST using this method. 1.2 Handling nan values is tricky, as many np operations meant for working with masked arrays calls np.copy, which would result in an 'in-memory' dataset again. 1.3. Re-implementing all the functionality we have now due to using gridtools won't be fast, and in some cases will be impossible.
Use a dirty hack to use np.memmap as the underlying array structure for coregistered datasets: https://stackoverflow.com/questions/44733067/do-xarray-or-dask-really-support-memory-mapping I haven't tried this yet, but it 'could' work. 2.1. Some time in the future some changes in xarray could easily break the undocumented features needed for this to work. 2.2. We have to implement additional temp file handling. What happens when we save the workflow? Do we save the temp file too? What happens when we save the dataset into a netcdf? Get rid of the tempfile? Where do we put the tempfile across platforms, etc. 2.3. Something might not work or work in an unexpected way due to using an undocumented feature. 2.4 The coregistered dataset will take space on disk. So, doing a coregistration to a very fine grid of a dataset that spans a long time span will silently eat away the available disk space.

In either case, fixing this definitively will not be trivial and will be a significant effort.

forman commented 5 years ago

CCI-Tools / cate