Open EmilRybergAkson opened 3 weeks ago
I've tried with rasterio at commit 7d1409318433934660cf98fef028e9097dc2ee45 and GDAL master and can't really a slow down. The timings you provide aren't super conclusive either about a slowdown (on my own machine this is quite noisy)
One potential explanation is that in your test script you don't clean files, so they accumulate in the same directory, and when opening a file, the ReadDir() operation might get slower and slower, but this shouldn't be GDAL specific. I can't spot anything related in the GDAL 3.9.0 to 3.9.1 diff
You may try setting the GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR environment variable / configuration option or adding a os.unlink(img_save_path)
at the end of each iteration
Thank you for the quick reply Even.
I retested from same rasterio commit as you, and noticed that 3.8.5 seems to have no slowdown, 3.9.0 have a bit of slowdown actually (starts out around 40ms per iteration and is at 50ms at the end of the 1000 iterations). With 3.9.1 it starts at around 40ms per iteration and at the end is above 120ms per iteration.
Setting GDAL_DISABLE_READDIR_ON_OPEN to TRUE or EMPTY_DIR makes no difference. It is a bit weird if you can't replicate it with the sample script i provided. There is a bit of noise, but especially with 3.9.1 it becomes significantly slower.
It should be noted that I'm using GDAL from conda-forge for all the tested versions, and rasterio from the Github repo.
I debugged / tested by adding timing to the rasterio Cython code for the files and lines i mentioned, which is how i found out that it seems to be the call to GDALOpenEx that is the cause of the slowdown. It seems like the debugger for Cython is a pain in the *** to get working, so I haven't delved deeper.
Setting GDAL_DISABLE_READDIR_ON_OPEN to TRUE or EMPTY_DIR makes no difference.
and what about adding a os.unlink(img_save_path)
at the end of each iteration ?
Same slowdown 🤔
This is the updated test script for reference:
import os
os.environ["GDAL_DISABLE_READDIR_ON_OPEN"] = "TRUE"
import rasterio
import numpy as np
import time
import shutil
dest_np = np.random.rand(1000, 1000)
crs = "EPSG:32633"
dest_transform = rasterio.transform.from_origin(0, 0, 1, 1)
save_dir = "test_tiffs"
if os.path.exists(save_dir):
shutil.rmtree(save_dir)
os.makedirs(save_dir, exist_ok=True)
with rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN="TRUE"):
for i in range(1000):
img_save_path = f"{save_dir}/test_{i}.tif"
start_time = time.time()
with rasterio.open(img_save_path, mode="w", driver="GTiff", compress="LZW", dtype=rasterio.float32,
crs=rasterio.crs.CRS.from_string(crs), blockysize=256, transform=dest_transform,
width=dest_np.shape[1], height=dest_np.shape[0],
count=1) as dataset_out:
dataset_out.write(dest_np, 1)
end_time = time.time()
os.unlink(img_save_path)
print(f"Time to save: {end_time - start_time}s")
I don't know if it would make sense to create an issue in the rasterio repo? I'm just assuming since the only change between the tests are different GDAL target versions when building rasterio, that the issue could be in GDAL.
FYI this is being tested on Windows 11 PC with Ryzen 5900X and NVME SSD.
My tests are done on Linux Ubuntu 20.04
| I don't know if it would make sense to create an issue in the rasterio repo?
Maybe... ?
If you are in a position to build GDAL from source, then doing a "git bisect" session could help a lot spotting the actual change that introduced the issue.
@EmilRybergAkson It won't be useful to open a Rasterio issue. I don't access to a Windows computer and won't be able to reproduce this any more than @rouault can.
Behaves like this on Windows in the OSGeo4W shell
But wouldn't it be better to save into memory instead of physical file and eliminate all the stuff that the operating system is doing with the file system?
Behaves like this on Windows in the OSGeo4W shell
But wouldn't it be better to save into memory instead of physical file and eliminate all the stuff that the operating system is doing with the file system?
Seems like the same problem I'm seeing. Good idea to plot it 😉
In the real use case it is processing 5k+ images that is later mosaiced into one orthophoto, so storing in RAM is not really feasible. We have had the same code running fine with 3.7.3, but upgrading to 3.9.1+ (3.9.3) we are encountering the issue.
@rouault I just finished testing with compiling GDAL from tag 3.9.1 + 3.9.3, with only Proj as dependency, and using that to compile rasterio. The problem is non-existent with self compiled version of GDAL, so I guess it could be some weird issue with the GDAL versions on conda-forge. I will do some more testing on another Windows PC to see if the problem exists there.
Could that possibly be related to the modularized GDAL builds on conda-forge that have started with GDAL 3.9.1 ... ? Do you have libgdal-XXXX conda packages installed ? If so perhaps try to uninstall them in case there would be a culprit among them... ?
Why I suggested using memory output for debugging is to find out what is the slow part. Now Even and Howard have not seen that issue on Linux but you and me have found it on Windows. Maybe the reason is somehow related to the Windows file system. How to test the hypothesis? Remove the file system from the play and write into memory. Doesn't it feel worth having a try? Do I see or just imagine that there are some steps in the graph? For sure there are some much slower executions, what happens then? If this was Java I would think that they could show some carbage collection checkpoints. Because I cannot program I won't test that myself.
Notice that my graph is from OSGeo4W build, not from conda.
Could that possibly be related to the modularized GDAL builds on conda-forge that have started with GDAL 3.9.1 ... ? Do you have libgdal-XXXX conda packages installed ? If so perhaps try to uninstall them in case there would be a culprit among them... ?
I had uninstalled the libgdal-core (only one installed) in the conda environment where i tested with local build of GDAL. In my other conda environment where i tested local rasterio install + conda-forge gdal only has libgdal-core.
I did some profiling with py-spy
and discovered that the problem actually seems to be something with either GDALRegister_COG or in tiff.dll
. I attached 2 profiling sessions, both looping for 100 iterations. "rasterio_with_gdal_conda_forge.speedscope.json" is the version with slowdown from conda forge. "rasterio_with_gdal_local_build.speedscope.json" is GDAL built locally, and is running fine. Both are using GDAL 3.9.3.
rasterio_with_gdal_conda_forge.speedscope.json rasterio_with_gdal_local_build.speedscope.json
You can open the files with https://www.speedscope.app/ (in browser)
We could replicate it on another Windows PC. It exhibits the exact same slowdown. Hopefully the provided profiling files can give you some ideas on what could be the problem. Let me know if there is anything else I can try.
Time per iteration over 1000 iterations:
Changing the provided minimum reproducible example to save to the same file instead of a new file each iteration, the slowdown does not occur. Is there perhaps something keeping track of each unique file path that is iterated for each dataset open causing the slowdown?
I couldn't make sense of the profile file. Can you try disabling drivers with GDAL_SKIP=foo,bar,baz where foo / bar /baz are driver short name to see if one driver might be the culprit ?
The list keeping only the GTiff driver can be obtained with:
python -c "from osgeo import gdal; print(','.join([gdal.GetDriver(i).ShortName for i in range(gdal.GetDriverCount())]).replace('GTiff,',''))"
I realised there are stack frames missing from the profiling files, so they are not entirely accurate. I tried compiling rasterio with debugging info, but it does not seem to work with py-spy, so I'm a bit clueless on how to debug it further... Using GDAL_SKIP to only use GTiff driver doesn't fix the problem.
It is a bit hard to debug when i don't have the PDB files for the build on conda-forge, and I can't replicate it with local minimal build of GDAL. But using Visual Studio profiler it also shows calls to tiff.dll that are slow. As far as I can see in the documentation tiff is included in the GDAL repo itself, so why would it call into tiff.dll in conda-forge GDAL and not my local build?
As far as I can see in the documentation tiff is included in the GDAL repo itself, so why would it call into tiff.dll in conda-forge GDAL and not my local build?
You need to check in your GDAL's CMakeLists.txt the value of the GDAL_USE_TIFF_INTERNAL variable If GDAL_USE_TIFF_INTERNAL is ON, then the copy in frmts/gtiff/libtiff will be used. If it is OFF, then external tiff.dll will be used.
Building using external tiff, using the same tiff.dll that comes from conda-forge build does not reproduce the problem with local GDAL build. Where can i see how gdal and libgdal-core is built for conda-forge? Hopefully if I can replicate my local build to match the conda-forge build, I can reproduce and properly debug the culprit.
Where can i see how gdal and libgdal-core is built for conda-forge?
GDAL: https://github.com/conda-forge/gdal-feedstock/blob/main/recipe/bld.bat and https://github.com/conda-forge/gdal-feedstock/blob/main/recipe/meta.yaml libtiff: https://github.com/conda-forge/libtiff-feedstock/blob/main/recipe/bld.bat and https://github.com/conda-forge/libtiff-feedstock/blob/main/recipe/meta.yaml
I have compiled GDAL using the same CMake flags in the bld.bat, but it still works fine with my locally built GDAL dll, but if i switch it out for the one from conda-forge the problem is there, i am using the 3rd party DLLs shipped with libcore-gdal from conda-forge to build GDAL locally, to keep that the same. I am building from tag v.3.9.3 FYI, and using libgdal-core 3.9.3 from conda-forge.
Any idea to what could be different yet?
What is the bug?
Using rasterio 1.3.11 compiled against GDAL 3.9.1 and later results in slowdown when creating datasets in a loop. It does not seem to be an issue in 3.9.0 and earlier. I have tried debugging to figure out what the problem is, but I have not been able to pinpoint it exactly, and what i have found has confused me even more...
The problem is line 317 in rasterio
_io.pyx
in the_delete_dataset_if_exists
which is called when a dataset is opened in "w" mode. The call toopen_dataset
is what slows down, specificallyGDALOpenEx
in_base.pyx
line 219.In some mock code i wrote, calling GDALOpenEx in both C++ and Python in a loop, the problem does not seem to occur. As far as I can tell there have also not been any changes to GDALOpenEx between 3.9.0 and 3.9.1, but could there be some error handling or other overhead that rasterio uses that is the root problem?
Steps to reproduce the issue
Running the following test code should show the problem:
With versions including and after GDAL 3.9.1 it should gradually become slower each iteration. This is not the case with GDAL 3.9.0 and earlier.
Sample from my PC with GDAL 3.9.1:
Versions and provenance
Python 3.12, but problem is also present in other Python versions (tested on 3.9 also) Windows 11. Rasterio version 1.3.11. Tested GDAL versions:
3.7.3 - Works 3.8.5 - Works 3.9.0 - Works 3.9.1 - Slowdown 3.9.3 - Slowdown
Additional context
No response