Booritas / slideio

BSD 3-Clause "New" or "Revised" License
49 stars 2 forks source link

Initial memory consumption #12

Closed ChristianMarzahl closed 1 year ago

ChristianMarzahl commented 1 year ago

Dear @Booritas,

Thank you very much for this helpful repository.

I have a question why the initial memory consumption for SVS images ranges from 1MB to over 100MB per image?

Environment:

Python 3.8.16
pip 22.3.1
slideio 2.0.2
Ubuntu 20.04

Code to reproduce the behaviour.

pip install -U memory_profiler

import slideio
from memory_profiler import profile

@profile(precision=4)
def image_a_svs(path):
    image = slideio.open_slide(str(path), driver="SVS")
    return image

@profile(precision=4)
def image_b_svs(path):
    image = slideio.open_slide(str(path), driver="SVS")
    return image

a = image_a_svs("***_a.svs")
b = image_b_svs("***_b.svs")

Results:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    69 156.2344 MiB 156.2344 MiB           1   @profile(precision=4)
    70                                         def image_a_svs(path):
    71 255.5977 MiB  ***99.3633 MiB***           1       image = slideio.open_slide(str(path), driver="SVS")
    72 255.5977 MiB   0.0000 MiB           1       return image

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    74 255.5977 MiB 255.5977 MiB           1   @profile(precision=4)
    75                                         def image_b_svs(path):
    76 258.0898 MiB   ***2.4922 MiB***           1       image = slideio.open_slide(str(path), driver="SVS")
    77 258.0898 MiB   0.0000 MiB           1       return image

Both images have roughly the same size and contain three levels, but image A has a memory footprint of 99.3633 MiB while image B of 2.4922 MiB. Do you know what causes this and how to avoid loading 100MB into memory?

While 100MB for one image is fine, loading hundreds of them for training deep learning models quickly adds up. The SCN shows a similar behaviour, but the GDAL for tiff images does not.

I really appreciate any help you can provide.

With kind regards, Christian

Booritas commented 1 year ago

Hi Christian, Thank you for your message. I will investigate the behavior and let you know about the results. I have a question about your workflow. Do you open 100 different images at the same time? The library should release the memory as soon as an object (slide, scene) is released. What is a normal scenario in your workflow? Do you have 1 slide and multiple scenes, or multiple slides and multiple scenes at the same time? It will help me to understand what problem you encounter. I will check the initial memory consumption anyway and let you know. Best regards, Stanislav

ChristianMarzahl commented 1 year ago

Dear Stanislav,

Thank you very much for your quick reply.

Our workflow is that we keep the WSI objects in a dict. This speeds up accessing tiles significantly because the initial load of the WSI is quite slow. We have normally one scene per slide.

An oversimplified example would look like the following:

slide_cache = {}

for idx, file in tqdm(enumerate(files)):
   slide_cache[file] = slideio.open_slide(str(file), driver="SCN")

So during training, we can quickly access tiles like:

random_slide = slide_cache.keys()[random_number]
slide = slide_cache[random_slide ]
slideio_scene = slide.get_scene(0)
tile = slideio_scene.read_region(x,y)

I hope this helps if you have more questions or want some example files. Please let me know.

With kind regards, Christian

Booritas commented 1 year ago

Dear Christian, thanks for your clarification. I did some profiling. The amount of the allocated memory depends on the image. According to the profiling, most of the memory is allocated for tiff structures of the images. Currently, the library opens the file multiple times, it includes the main image and auxiliary images (thumbnails, macro, labels, etc.). It is done to avoid concurrent access to the same file structures from different scenes.
I'm optimizing the behavior. Postponing of opening of the auxiliary images until they are requested, already reduced memory consumption. I want to do some more investigation and try to reduce the memory footprint. I expect that I can publish a new version with reduced memory consumption at end of this week or beginning of the next week. If you want to test new version (current state), you can download wheel files from here. Best Regards, Stanislav

ChristianMarzahl commented 1 year ago

Dear Stanislav,

Thank you very much for your reply and for looking into it. I will check it out tomorrow.

With kind regards, Christian

Booritas commented 1 year ago

Hi Christian, I just published a new version '2.0.4'. It contains the fix described in the previous message. As I wrote, the memory consumption depends on image. Most of the memory is allocated by libtiff library. SlideIO add just a few kB. I hope the fix will help. Please let me know if you have any problem with the library. If you like the library, please consider giving a star to the repository. Best regards and thanks for your help! Stanislav

ChristianMarzahl commented 1 year ago

Dear Stanislav,

Thank you very much for your excellent work. Interesting that the libtiff library has such different results in memory consumption for images that a roughly the same size.

The repository star for your lib is more than deserved.

With kind regards, Christian