OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.93k stars 2.56k forks source link

Rasterio / GDALOpenEx performance slowdown in loops #11164

Open EmilRybergAkson opened 3 weeks ago

EmilRybergAkson commented 3 weeks ago

What is the bug?

Using rasterio 1.3.11 compiled against GDAL 3.9.1 and later results in slowdown when creating datasets in a loop. It does not seem to be an issue in 3.9.0 and earlier. I have tried debugging to figure out what the problem is, but I have not been able to pinpoint it exactly, and what i have found has confused me even more...

The problem is line 317 in rasterio _io.pyx in the _delete_dataset_if_exists which is called when a dataset is opened in "w" mode. The call to open_dataset is what slows down, specifically GDALOpenEx in _base.pyx line 219.

In some mock code i wrote, calling GDALOpenEx in both C++ and Python in a loop, the problem does not seem to occur. As far as I can tell there have also not been any changes to GDALOpenEx between 3.9.0 and 3.9.1, but could there be some error handling or other overhead that rasterio uses that is the root problem?

Steps to reproduce the issue

Running the following test code should show the problem:

import rasterio
import numpy as np
import os
import time
import shutil

dest_np = np.random.rand(1000, 1000)
crs = "EPSG:32633"
dest_transform = rasterio.transform.from_origin(0, 0, 1, 1)

save_dir = "test_tiffs"
if os.path.exists(save_dir):
    shutil.rmtree(save_dir)

os.makedirs(save_dir, exist_ok=True)
with rasterio.Env():
    for i in range(1000):
        img_save_path = f"{save_dir}/test_{i}.tif"
        start_time = time.time()
        with rasterio.open(img_save_path, mode="w", driver="GTiff", compress="LZW", dtype=rasterio.float32,
                           crs=rasterio.crs.CRS.from_string(crs), blockysize=256, transform=dest_transform,
                           width=dest_np.shape[1], height=dest_np.shape[0],
                           count=1) as dataset_out:
            dataset_out.write(dest_np, 1)
        end_time = time.time()
        print(f"Time to save: {end_time - start_time}s")

With versions including and after GDAL 3.9.1 it should gradually become slower each iteration. This is not the case with GDAL 3.9.0 and earlier.

Sample from my PC with GDAL 3.9.1:

Time to save: 0.05449843406677246s
Time to save: 0.04797816276550293s
Time to save: 0.04688286781311035s
Time to save: 0.0468904972076416s
Time to save: 0.04593157768249512s
Time to save: 0.04598069190979004s
Time to save: 0.04586148262023926s
Time to save: 0.04588150978088379s
Time to save: 0.04582929611206055s
Time to save: 0.045999765396118164s
Time to save: 0.04484057426452637s
Time to save: 0.04634523391723633s
Time to save: 0.04588055610656738s
Time to save: 0.04595613479614258s
Time to save: 0.04711031913757324s
Time to save: 0.04685401916503906s
Time to save: 0.046930789947509766s
Time to save: 0.04686570167541504s
.....
Time to save: 0.05614948272705078s
Time to save: 0.0571291446685791s
Time to save: 0.05603289604187012s
Time to save: 0.05613136291503906s
Time to save: 0.0571291446685791s
Time to save: 0.058188438415527344s
Time to save: 0.05613589286804199s
Time to save: 0.05608987808227539s
Time to save: 0.056015968322753906s
Time to save: 0.0596466064453125s
Time to save: 0.06095004081726074s
Time to save: 0.05870771408081055s
Time to save: 0.05717587471008301s
Time to save: 0.057085514068603516s
Time to save: 0.057097434997558594s
Time to save: 0.05931401252746582s
......

Versions and provenance

Python 3.12, but problem is also present in other Python versions (tested on 3.9 also) Windows 11. Rasterio version 1.3.11. Tested GDAL versions:

3.7.3 - Works 3.8.5 - Works 3.9.0 - Works 3.9.1 - Slowdown 3.9.3 - Slowdown

Additional context

No response

rouault commented 3 weeks ago

I've tried with rasterio at commit 7d1409318433934660cf98fef028e9097dc2ee45 and GDAL master and can't really a slow down. The timings you provide aren't super conclusive either about a slowdown (on my own machine this is quite noisy) One potential explanation is that in your test script you don't clean files, so they accumulate in the same directory, and when opening a file, the ReadDir() operation might get slower and slower, but this shouldn't be GDAL specific. I can't spot anything related in the GDAL 3.9.0 to 3.9.1 diff You may try setting the GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR environment variable / configuration option or adding a os.unlink(img_save_path) at the end of each iteration

EmilRybergAkson commented 3 weeks ago

Thank you for the quick reply Even.

I retested from same rasterio commit as you, and noticed that 3.8.5 seems to have no slowdown, 3.9.0 have a bit of slowdown actually (starts out around 40ms per iteration and is at 50ms at the end of the 1000 iterations). With 3.9.1 it starts at around 40ms per iteration and at the end is above 120ms per iteration.

Setting GDAL_DISABLE_READDIR_ON_OPEN to TRUE or EMPTY_DIR makes no difference. It is a bit weird if you can't replicate it with the sample script i provided. There is a bit of noise, but especially with 3.9.1 it becomes significantly slower.

It should be noted that I'm using GDAL from conda-forge for all the tested versions, and rasterio from the Github repo.

I debugged / tested by adding timing to the rasterio Cython code for the files and lines i mentioned, which is how i found out that it seems to be the call to GDALOpenEx that is the cause of the slowdown. It seems like the debugger for Cython is a pain in the *** to get working, so I haven't delved deeper.

rouault commented 3 weeks ago

Setting GDAL_DISABLE_READDIR_ON_OPEN to TRUE or EMPTY_DIR makes no difference.

and what about adding a os.unlink(img_save_path) at the end of each iteration ?

EmilRybergAkson commented 3 weeks ago

Same slowdown 🤔

This is the updated test script for reference:

import os
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "TRUE"

import rasterio
import numpy as np

import time
import shutil

dest_np = np.random.rand(1000, 1000)
crs = "EPSG:32633"
dest_transform = rasterio.transform.from_origin(0, 0, 1, 1)

save_dir = "test_tiffs"
if os.path.exists(save_dir):
    shutil.rmtree(save_dir)

os.makedirs(save_dir, exist_ok=True)
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="TRUE"):
    for i in range(1000):
        img_save_path = f"{save_dir}/test_{i}.tif"
        start_time = time.time()
        with rasterio.open(img_save_path, mode="w", driver="GTiff", compress="LZW", dtype=rasterio.float32,
                           crs=rasterio.crs.CRS.from_string(crs), blockysize=256, transform=dest_transform,
                           width=dest_np.shape[1], height=dest_np.shape[0],
                           count=1) as dataset_out:
            dataset_out.write(dest_np, 1)
        end_time = time.time()
        os.unlink(img_save_path)
        print(f"Time to save: {end_time - start_time}s")

I don't know if it would make sense to create an issue in the rasterio repo? I'm just assuming since the only change between the tests are different GDAL target versions when building rasterio, that the issue could be in GDAL.

FYI this is being tested on Windows 11 PC with Ryzen 5900X and NVME SSD.

rouault commented 3 weeks ago

My tests are done on Linux Ubuntu 20.04

| I don't know if it would make sense to create an issue in the rasterio repo?

Maybe... ?

If you are in a position to build GDAL from source, then doing a "git bisect" session could help a lot spotting the actual change that introduced the issue.

sgillies commented 3 weeks ago

@EmilRybergAkson It won't be useful to open a Rasterio issue. I don't access to a Windows computer and won't be able to reproduce this any more than @rouault can.

jratike80 commented 3 weeks ago

Behaves like this on Windows in the OSGeo4W shell

image

But wouldn't it be better to save into memory instead of physical file and eliminate all the stuff that the operating system is doing with the file system?

EmilRybergAkson commented 3 weeks ago

Behaves like this on Windows in the OSGeo4W shell

image

But wouldn't it be better to save into memory instead of physical file and eliminate all the stuff that the operating system is doing with the file system?

Seems like the same problem I'm seeing. Good idea to plot it 😉

In the real use case it is processing 5k+ images that is later mosaiced into one orthophoto, so storing in RAM is not really feasible. We have had the same code running fine with 3.7.3, but upgrading to 3.9.1+ (3.9.3) we are encountering the issue.

@rouault I just finished testing with compiling GDAL from tag 3.9.1 + 3.9.3, with only Proj as dependency, and using that to compile rasterio. The problem is non-existent with self compiled version of GDAL, so I guess it could be some weird issue with the GDAL versions on conda-forge. I will do some more testing on another Windows PC to see if the problem exists there.

rouault commented 3 weeks ago

Could that possibly be related to the modularized GDAL builds on conda-forge that have started with GDAL 3.9.1 ... ? Do you have libgdal-XXXX conda packages installed ? If so perhaps try to uninstall them in case there would be a culprit among them... ?

jratike80 commented 3 weeks ago

Why I suggested using memory output for debugging is to find out what is the slow part. Now Even and Howard have not seen that issue on Linux but you and me have found it on Windows. Maybe the reason is somehow related to the Windows file system. How to test the hypothesis? Remove the file system from the play and write into memory. Doesn't it feel worth having a try? Do I see or just imagine that there are some steps in the graph? For sure there are some much slower executions, what happens then? If this was Java I would think that they could show some carbage collection checkpoints. Because I cannot program I won't test that myself.

Notice that my graph is from OSGeo4W build, not from conda.

EmilRybergAkson commented 3 weeks ago

Could that possibly be related to the modularized GDAL builds on conda-forge that have started with GDAL 3.9.1 ... ? Do you have libgdal-XXXX conda packages installed ? If so perhaps try to uninstall them in case there would be a culprit among them... ?

I had uninstalled the libgdal-core (only one installed) in the conda environment where i tested with local build of GDAL. In my other conda environment where i tested local rasterio install + conda-forge gdal only has libgdal-core.

I did some profiling with py-spy and discovered that the problem actually seems to be something with either GDALRegister_COG or in tiff.dll. I attached 2 profiling sessions, both looping for 100 iterations. "rasterio_with_gdal_conda_forge.speedscope.json" is the version with slowdown from conda forge. "rasterio_with_gdal_local_build.speedscope.json" is GDAL built locally, and is running fine. Both are using GDAL 3.9.3.

rasterio_with_gdal_conda_forge.speedscope.json rasterio_with_gdal_local_build.speedscope.json

You can open the files with https://www.speedscope.app/ (in browser)

EmilRybergAkson commented 3 weeks ago

We could replicate it on another Windows PC. It exhibits the exact same slowdown. Hopefully the provided profiling files can give you some ideas on what could be the problem. Let me know if there is anything else I can try.

Time per iteration over 1000 iterations:

image

EmilRybergAkson commented 3 weeks ago

Changing the provided minimum reproducible example to save to the same file instead of a new file each iteration, the slowdown does not occur. Is there perhaps something keeping track of each unique file path that is iterated for each dataset open causing the slowdown?

rouault commented 3 weeks ago

I couldn't make sense of the profile file. Can you try disabling drivers with GDAL_SKIP=foo,bar,baz where foo / bar /baz are driver short name to see if one driver might be the culprit ?

The list keeping only the GTiff driver can be obtained with:

python -c "from osgeo import gdal; print(','.join([gdal.GetDriver(i).ShortName for i in range(gdal.GetDriverCount())]).replace('GTiff,',''))"
EmilRybergAkson commented 3 weeks ago

I realised there are stack frames missing from the profiling files, so they are not entirely accurate. I tried compiling rasterio with debugging info, but it does not seem to work with py-spy, so I'm a bit clueless on how to debug it further... Using GDAL_SKIP to only use GTiff driver doesn't fix the problem.

It is a bit hard to debug when i don't have the PDB files for the build on conda-forge, and I can't replicate it with local minimal build of GDAL. But using Visual Studio profiler it also shows calls to tiff.dll that are slow. As far as I can see in the documentation tiff is included in the GDAL repo itself, so why would it call into tiff.dll in conda-forge GDAL and not my local build?

rouault commented 3 weeks ago

As far as I can see in the documentation tiff is included in the GDAL repo itself, so why would it call into tiff.dll in conda-forge GDAL and not my local build?

You need to check in your GDAL's CMakeLists.txt the value of the GDAL_USE_TIFF_INTERNAL variable If GDAL_USE_TIFF_INTERNAL is ON, then the copy in frmts/gtiff/libtiff will be used. If it is OFF, then external tiff.dll will be used.

EmilRybergAkson commented 3 weeks ago

Building using external tiff, using the same tiff.dll that comes from conda-forge build does not reproduce the problem with local GDAL build. Where can i see how gdal and libgdal-core is built for conda-forge? Hopefully if I can replicate my local build to match the conda-forge build, I can reproduce and properly debug the culprit.

rouault commented 3 weeks ago

Where can i see how gdal and libgdal-core is built for conda-forge?

GDAL: https://github.com/conda-forge/gdal-feedstock/blob/main/recipe/bld.bat and https://github.com/conda-forge/gdal-feedstock/blob/main/recipe/meta.yaml libtiff: https://github.com/conda-forge/libtiff-feedstock/blob/main/recipe/bld.bat and https://github.com/conda-forge/libtiff-feedstock/blob/main/recipe/meta.yaml

EmilRybergAkson commented 3 weeks ago

I have compiled GDAL using the same CMake flags in the bld.bat, but it still works fine with my locally built GDAL dll, but if i switch it out for the one from conda-forge the problem is there, i am using the 3rd party DLLs shipped with libcore-gdal from conda-forge to build GDAL locally, to keep that the same. I am building from tag v.3.9.3 FYI, and using libgdal-core 3.9.3 from conda-forge.

Any idea to what could be different yet?