corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
502 stars 80 forks source link

Memory leak when looping through data variables of a dataset loaded from a VRT #774

Open amaissen opened 2 months ago

amaissen commented 2 months ago

Code Sample, a copy-pastable example if possible

A "Minimal, Complete and Verifiable Example" will make it much easier for maintainers to help you: http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def memory_leak():
  raster = rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)

  for band in bands:
    data = raster[band].copy(deep=True).load()

    delete data
    gc.collect()

Problem description

The allocated memory increases after each iteration.

Expected Output

The memory is released after each iteration, so one can process multi-band datasets that do not fit in memory.

Environment Information

rioxarray (0.15.5) deps:
  rasterio: 1.3.10
    xarray: 2024.3.0
      GDAL: 3.8.4
      GEOS: 3.11.1
      PROJ: 9.3.1
 PROJ DATA: /opt/conda/envs/some-env/share/proj
 GDAL DATA: /opt/conda/envs/some-env/share/gdal

Other python deps:
     scipy: 1.13.0
    pyproj: 3.6.1

System:
    python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
executable: /opt/conda/envs/some-env/bin/python
   machine: Linux-5.15.0-101-generic-x86_64-with-glibc2.35

Conda environment information (if you installed with conda):


Environment (conda list):

``` gdal 3.8.5 py310h3b926b6_2 conda-forge libgdal 3.8.5 hf9625ee_2 conda-forge rasterio 1.3.10 pypi_0 pypi rioxarray 0.15.5 pypi_0 pypi xarray 2024.3.0 pypi_0 pypi ```
snowman2 commented 2 months ago

Add these kwargs to open_rasterio to disable caching:

lock=False,  # disable internal caching
cache=False,  # don't keep data loaded in memory. pull from disk every time
amaissen commented 2 months ago

@snowman2 , thanks for pointing to these options. I tired the options you suggested but they did not help to release the memory.

However, when I store the entire raster to zarr storage with to_zarr(), and load with raster = xarray.open_zarr(...), I don't see any memory leaks when iterating through the data variables. This would look like

import rioxarray as rxr
import xarray as xr
import gc

PATH = "path_to_multi_band_vrt.vrt"

def no_memory_leak():
  # Read from VRT and save to zarr (one chunk per band)
  rxr.open_rasterio(PATH, band_as_variable=True, chunks={"x": -1, "y": -1}).to_zarr(some_temp_dataset)

  # Open zarr and iterate over data vars.
  raster = xr.open_zarr(some_temp_dataset, chunks={"x": -1, "y": -1})
  bands = list(raster.data_vars)

  for band in bands:
    data = raster[band].copy(deep=True).load()

    del data
    gc.collect()
GregoryPetrochenkov-NOAA commented 3 weeks ago

I have experienced similar issues with memory leaks. I did an experiment with rioxarray loading GeoTIFFs and similarly with xarray loading NetCDF files.

The tests use:

I would run the operation 5 times and memory profile with memray as follows (run in a Jupyter Notebook):

import time
import subprocess
import os
import gc
from functools import partial

import memray
import rioxarray as rxr
import xarray as xr

%load_ext memray

def run_test(func):
    """
    Driver to run memory accumulation test by running 5 times simulating a batch process
    """

    for x in range(5):
        func()

    time.sleep(1)

I also set up these arguments in advance:

xarray_kwargs = {
   "cand_file": "./subsample_benchmark_mean.nc",
   "bench_file": "./subsample_candidate_mean.nc",
    "cache": False,
    "lock": False
}

rio_kwargs = {
    "cand_file": "./c_uint8.tif",
    "bench_file": "./b_uint8.tif",
    "cache": False,
    "lock": False
}

The first snippet shows rioxarray loading the two GeoTIFFs in a context wrapper which automatically calls the close method when finished:

%%memray_flamegraph --temporal

def run_context_load(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers
    """

    with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
            rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()

run_test(partial(run_context_load, **rio_kwargs))

image

As can be seen the memory is never released. In practice of larger iterations it leads to very large memory consumption due to accumulation of memory.

The same method is done with xarray and NetCDF files as follows:

%%memray_flamegraph --temporal

def run_context_load_xr(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers
    """

    with (xr.open_dataset(cand_file, cache=cache, lock=lock) as ds,
             xr.open_dataset(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()

run_test(partial(run_context_load_xr, **xarray_kwargs))

image

As can be seen, as expected all memory is released by the end of the operation.

I tried changing the cache and lock arguments to no avail, I could not get rioxarray to behave similarly. The only way to fully release memory is to directly delete the objects and garbage collect:

%%memray_flamegraph --temporal

def run_context_load_delete_gc(cand_file, bench_file, cache=False, lock=False):
    """
    Loads with context wrappers and deletes objects
    """

    with (rxr.open_rasterio(cand_file, cache=cache, lock=lock) as ds,
            rxr.open_rasterio(bench_file, cache=cache, lock=lock) as ds2):

        # Pure load
        ds.load()
        ds2.load()

    del ds, ds2
    gc.collect()

run_test(partial(run_context_load_delete_gc, **rio_kwargs))

image

While this does work, it is not a clean solution and would necessitate prescribing users to do so. I would suggest to relabel this issue as a bug because this takes extra work for a user to diagnose. A user would not expect this behavior when loading GeoTIFFs in rioxarray. This is a dependency in a package I am supporting and the memory accumulation caused issues for workflows as can be seen in this memory profiling example:

image

Also, I apologize for not being able to simply paste the data as would be preferable but If you like I can provide my notebooks and data which total in about 100 MBs.

snowman2 commented 2 weeks ago

The GDAL cache settings may be worth looking into: https://gdal.org/user/configoptions.html#performance-and-caching