Open MTh3399 opened 1 month ago
@MTh3399 I don't reproduce any RAM consumption increase when running your script locally on my Ubuntu 20.04 machine with 32 GB of RAM using GDAL master. RAM consumption remains stable at 0.4 % (sligthly below 1GB) all the time, for the 2 runs.
@MTh3399 I don't believe much in the following theory, but who knows... It would be good to check if that might be a RAM fragmentation issue (which is sometimes observed in multithreading usages, but I'm not aware of it for single threaded ones): https://gdal.org/en/latest/user/multithreading.html#ram-fragmentation-and-multi-threading . So basically if you can find to run your process against libtcmalloc and see if that makes a difference. Also you could try to see if changing the type of EC2 instance would make a difference (in particular trying different distributions and Linux Kernel versions)
What is the bug?
While reading multiple rasters multiple times, the memory keep everything in cache even after the script ends.
Initially working with large VRT I figured out that my memory was increasing a lot reading relatively small tiles (1024px). I tried to reproduce the error excluding the VRT usage and i managed to do it when i'm reading the same tile content two times. The first time the memory go up then down once i close the dataset. The second time the memory keep the data in cache and I don't know why.
Steps to reproduce the issue
I've made a simplified scipt of my use case which for a list of tif files read them, perform some operations and close them.
As I said, initially i was working with a quite large VRT linking approximatively 1000 tiles (4000px,4000px) of ~45Mo.
I cannot shared the data i'm working with but the issue could be reproduce with dummy data generated by the following script :
Versions and provenance
I'm running my code in an amazon instance through a docker container. Here is the dockerfile to build the image i'm using :
Here is the dependencies that i'm using : "geopandas==0.14.3", "pandarallel==1.6.5", "numpy==1.26.4", "pandas==2.2.0", "tqdm==4.66.5",
Additional context
Here a graphic visualisation of the memory usage on my machine. The first bump a ~14:52 for the first run and right after at 14:53:30 the second run which keep everything in cache and even when it finished nothing is released. To manually free the memory i have to re-write the rasters