OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.91k stars 2.55k forks source link

Caching issue when reading multiple rasters #10853

Open MTh3399 opened 1 month ago

MTh3399 commented 1 month ago

What is the bug?

While reading multiple rasters multiple times, the memory keep everything in cache even after the script ends.

Initially working with large VRT I figured out that my memory was increasing a lot reading relatively small tiles (1024px). I tried to reproduce the error excluding the VRT usage and i managed to do it when i'm reading the same tile content two times. The first time the memory go up then down once i close the dataset. The second time the memory keep the data in cache and I don't know why.

Steps to reproduce the issue

I've made a simplified scipt of my use case which for a list of tif files read them, perform some operations and close them.

import os
import numpy as np
from osgeo import gdal
from tqdm import tqdm

def colorimetric_global_raster_check_tile_gdal(raster_path: str) -> float:
    """
    Function which checks for a given raster the percentage of padded pixels in the image using GDAL

    Args:
        raster_path (str): Input raster path to analyze

    Returns:
        float: Percentage of padded pixels in the image
    """

    # Open the raster using GDAL
    raster = gdal.Open(raster_path)
    if raster is None:
        raise ValueError(f"Unable to open the raster {raster_path}.")

    amount_pixels_raster = raster.RasterXSize * raster.RasterYSize
    amount_of_white_padded_pixels = 0
    amount_of_black_padded_pixels = 0

    block_size = 1024

    for x in range(0, raster.RasterXSize, block_size):
        for y in range(0, raster.RasterYSize, block_size):

            gray_tile = raster.ReadAsArray(
                x,
                y,
                min(block_size, raster.RasterXSize - x),
                min(block_size, raster.RasterYSize - y),
            )

            # If the VRT/raster has multiple bands, average them to create a grayscale image
            if len(gray_tile.shape) == 3:
                gray_tile = gray_tile.mean(axis=0)

            # Collect amount of white/black pixels
            amount_of_white_padded_pixels += np.sum(
                gray_tile > 245
            )
            amount_of_black_padded_pixels += np.sum(
                gray_tile < 10
            )

            gray_tile = None

    padded_percentage = (
        max(amount_of_black_padded_pixels, amount_of_white_padded_pixels)
        / amount_pixels_raster
    )

    raster.Close()

    return padded_percentage

if __name__ == "__main__":

    pcrs_list = [
        os.path.join("/data/rasters_dummy", tile) for tile in os.listdir("/data/rasters_dummy")
    ]

    res = []

    # FIRST RUN (EVERYTHING FINE)
    for path in tqdm(pcrs_list):
        res.append(colorimetric_global_raster_check_tile_gdal(raster_path=path))

    # SECOND RUN (EVERYTHING IS CACHED)
    for path in tqdm(pcrs_list):
        res.append(colorimetric_global_raster_check_tile_gdal(raster_path=path))

As I said, initially i was working with a quite large VRT linking approximatively 1000 tiles (4000px,4000px) of ~45Mo.
I cannot shared the data i'm working with but the issue could be reproduce with dummy data generated by the following script :

import os
import numpy as np
from osgeo import gdal, osr
from tqdm import tqdm  # For progress tracking

# Directory to save the rasters
output_dir = '/data/rasters_dummy'
os.makedirs(output_dir, exist_ok=True)

# Constants
width, height = 4000, 4000  # Dimensions of the rasters (full image size)
block_size = 1024  # Tile block size
num_rasters = 997  # Number of rasters to generate
num_bands = 3  # Number of bands
crs_epsg = 2154  # Coordinate Reference System (EPSG:2154)

# Affine transform parameters (dummy values, adjust as needed)
geotransform = [0, 1, 0, 0, 0, -1]  # Equivalent to rasterio's from_origin(0, 0, 1, 1)

# Generate rasters
for i in tqdm(range(num_rasters), desc="Generating rasters"):
    raster_filename = os.path.join(output_dir, f'raster_{i+1}.tif')

    # Create the driver for GeoTIFF format
    driver = gdal.GetDriverByName('GTiff')

    # Create the raster dataset with the specified width, height, and number of bands
    dataset = driver.Create(
        raster_filename,
        width,
        height,
        num_bands,
        gdal.GDT_Byte,
        options=['TILED=YES', 'BLOCKXSIZE=1024', 'BLOCKYSIZE=1024', 'INTERLEAVE=PIXEL'],
    )

    # Set CRS (Coordinate Reference System)
    srs = osr.SpatialReference()
    srs.ImportFromEPSG(crs_epsg)
    dataset.SetProjection(srs.ExportToWkt())

    # Set the affine transformation (geotransform)
    dataset.SetGeoTransform(geotransform)

    # Create a 3D array filled with zeros (for 3 bands)
    data = np.zeros((num_bands, height, width), dtype='uint8')

    # Write each band
    for band_idx in range(num_bands):
        band = dataset.GetRasterBand(band_idx + 1)
        band.WriteArray(data[band_idx, :, :])

        # Set NoData value
        band.SetNoDataValue(0)

    # Close the dataset (flushes to disk)
    dataset.FlushCache()
    dataset = None

print(f"{num_rasters} rasters created in '{output_dir}' directory.")

Versions and provenance

I'm running my code in an amazon instance through a docker container. Here is the dockerfile to build the image i'm using :

FROM continuumio/miniconda3:4.12.0

RUN apt-get --allow-releaseinfo-change update -y && apt-get install -y \
    cmake \
    build-essential \
    pip
RUN apt-get install -y libgl1

RUN conda install -c conda-forge gdal==3.9.2

COPY . /src
WORKDIR /src

RUN pip install  -e .

Here is the dependencies that i'm using : "geopandas==0.14.3", "pandarallel==1.6.5", "numpy==1.26.4", "pandas==2.2.0", "tqdm==4.66.5",

Additional context

Here a graphic visualisation of the memory usage on my machine. The first bump a ~14:52 for the first run and right after at 14:53:30 the second run which keep everything in cache and even when it finished nothing is released. To manually free the memory i have to re-write the rasters

image

rouault commented 1 month ago

@MTh3399 I don't reproduce any RAM consumption increase when running your script locally on my Ubuntu 20.04 machine with 32 GB of RAM using GDAL master. RAM consumption remains stable at 0.4 % (sligthly below 1GB) all the time, for the 2 runs.

rouault commented 1 month ago

@MTh3399 I don't believe much in the following theory, but who knows... It would be good to check if that might be a RAM fragmentation issue (which is sometimes observed in multithreading usages, but I'm not aware of it for single threaded ones): https://gdal.org/en/latest/user/multithreading.html#ram-fragmentation-and-multi-threading . So basically if you can find to run your process against libtcmalloc and see if that makes a difference. Also you could try to see if changing the type of EC2 instance would make a difference (in particular trying different distributions and Linux Kernel versions)