Openslide python speed?

slobodaapl commented 2 years ago

Hey there! Recently found your openslide modification to use in Python, and I'm stoked there's something other than python bioformats. But I noticed that for reading regions (I need to do a sliding window so I can't avoid it), it's significantly slower than python bioformats, which surprised me.

Do you have any tips as to how make large region reading using your python library faster?.. Or what may be causing it to be slow? I simply opened the file as the brief documentation describes in your repo using the python context manager and attempted to load several successive regions within that context, all of them about 4000x4000, though with significantly less speed than bioformats.

Nico-Curti commented 2 years ago

Hi @slobodaapl! I'm still working on the updated version of openslide, so there might be some computational issues... However, the computational performances of IO strongly depend on the WSI format that you are using: in my forked version I contributed only on the Olympus (.vsi, .ets, .ome-tiff) format. A criticality, already pointed out by other authors, is the decompression of JPEG2K format: it is particularly slow in openslide and very hard to improve.

About the Python wrap of the library, there are several points on which I'm currently working on, which can improve the computational performances:

the wrap aims to mimic as much as possible the original C functions, but a good improvement could be given by wrapping several steps into a single C function (called in the Cython module), minimizing data transfer between python and C.
inside the read_region function I use the numpy arrays to store and manipulate buffer of memories, but I haven't tested the presence of copies vs view yet.
a computational efficient python wrap should consider transforming to numpy arrays using the C APIs of Numpy and not just a naïve cython definition.

However, regarding the problem that you describe, if your goal is to run a sliding window, I suggest you re-think your code implementation: IO steps are always particularly slow and represent the bottleneck of a lot of applications; a better approach would be to consider minimizing the IO and performing all the manipulations using memory buffers. The patche-sizes that you describe are very large, so I don't know if this could be a valid use case for you, but since my openslide version provides the integration with numpy, I suggest you take a look to the as_strided function of numpy: by reading a larger region of the image you can easily mimic a sliding window approach (avoiding copies in memory) using this function.

Let me know if I can help you further!

P.s. for issues or pr about the library, please open an issue/pr on the library repo: in this way our conversion could be useful also for other users who fall in the same problems. Thanks!

slobodaapl commented 2 years ago

Thanks for your thorough reply!

To be more specific about my problem, I'm opening '.vsi' files which happens to be one of the slower formats, and my application is deep learning, and due to the amount of data I have, conversion is unfeasible.

I will definitely look into the use of NumPy's as_strided function, it looks very promising, as I'm currently inefficiently making copies of array slices.

Since your library version is parallelizable, I'll also attempt to make use of that, which is a massive benefit over python-bioformats which runs using a Java bridge

Cheers and looking forward to future updates to your library!

Nico-Curti commented 2 years ago

Let me know if you do some upgrades on the VSI format support! I'm planning to commit a relatively big update of the library asap.

slobodaapl commented 2 years ago

@Nico-Curti I've been working on it some and I do believe I found a bug with the library. I went to your repository for openslide, however, the Issues tab isn't available there, so I'm posting here as a reply instead.

I'm able to use most functionality of your openslide version just fine, however, I can't seem to save it as a numpy array representation:

filearray = Openslide('data/1_HE.vsi', dtype=BRIGHTFIELD)
filearray = np.array(filearray, copy=False)

Produces the error:

- Traceback (most recent call last):
-   File "debug.py", line 16, in <module>
-     filearray = np.array(filearray, copy=False)
-   File "openslide.pyx", line 279, in openslide.Openslide.__array__
-   File "openslide.pyx", line 475, in openslide.Openslide.read_region
-   File "openslide.pyx", line 409, in openslide.Openslide._read_brightfield_region
- AttributeError: 'openslide.Openslide' object has no attribute 'get_level_dimensions'

The same happens if we attempt to take a slice, which seems even slower than read_region, and ultimately produces the same error:

print(filearray[0:64, 0:64, 0].shape)

I'm also unsure how the buffering is handled for massive images (30+ GB), and whether interpreting it as a NumPy array would make it attempt to load the image into memory or whether it remains only a buffer that reads from the file only under certain conditions, same as with the as_strided function you mentioned.

Nico-Curti commented 2 years ago

Hi @slobodaapl! Thanks for the bug report. I noticed that I left an old version of the code inside a check in the read_region function (ref. here). However, this means that when you try to read the image, you are falling inside a that if-branch, i.e. your attempt to read is failed.

It is hard for me to reproduce this error, since all the tests that I have done on my .vsi files seems to work good. Please check that the directory tree related to your vsi file works. In particular, for a file called data/1_HE.vsi you must have the following directory tree:

data/1_HE.vsi
|__    _1_HE_
  |___    stack1
    |___   frame_t.ets 
    |___   frame_t1.ets
  |___    stack1000
  |___    stackN

Pay attention also that the current version of the library support the management of only the first ets (or tif) file found in the first directory, i.e. the frame_t.ets file in the case above! I'm working on it...

slobodaapl commented 2 years ago

Hey @Nico-Curti, thanks for your reply! I see what you mean and it's surprising to me that there's a NULL read on the data, because the layout you describe seems to be exactly as you describe, or more specifically:

data/1_HE.vsi
|__    _1_HE_
  |___    stack1
    |___   frame_t.ets

All my .vsi files only have one stack associated with them with the exact structure, and interestingly enough, I am able to change the internal resolution and save the downscaled image to a file using opencv just fine. It seems just that direct manipulations of the Openslide() class instance via array operations fail with the error described above.

I also found that your library, using .shape, gives different width and height dimensions than python-bioformats for the same .vsi file at the same level interestingly enough. I verified the dimensions and indeed, openslide seems to over-estimate the bounds of the .vsi image. If we attempt to read the region near the very edge of the file, in other words the region that shouldn't exist, we find we are indeed given a zero array:

from openslide import Openslide, BRIGHTFIELD

with Openslide('data/1_HE.vsi', dtype=BRIGHTFIELD) as slide:
    print(slide.shape)
    print(slide.header)
    print(slide.read_region(103000, 105000, 0, 400, 400)[0:5, 0:5, 0])

This prints:

(105472, 103424)

{'COMMENT': None, 'VENDOR': 'olympus', 'QUICKHASH1': 'e9f1377f545caaa085a4b78e47518a0263ce4c49ba20818071434b15c9e1ace7', 'BACKGROUND_COLOR': 'FFFFFF', 'OBJECTIVE_POWER': None, 'MPP_X': '352.77777777777777', 'MPP_Y': '352.77777777777777', 'BOUNDS_X': '0', 'BOUNDS_Y': '0', 'BOUNDS_WIDTH': '105472', 'BOUNDS_HEIGHT': '103424'}

[[0 0 0 0 0]  
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
# This shouldn't be possible in my images, this is definitely incorrect

It seems openslide doesn't read the OME Xml data correctly and comes up with incorrect image dimensions.

I wouldn't be opposed to sending you an example of the .vsi file data I'm using so I can help you track down the issue, if it's not on my side. Just let me know if you're interested and I'll arrange to send you a link to your university email (or where you prefer).

Nico-Curti commented 2 years ago

I know that openslide tends to overestimate the image dimension and it is due to a not perfect division between the patch size stored into the OME XML header (typically 512) and the real level dimensions. I intentionally set the level shape to an "overestimated" dimension to allow the correct behavior of cairo and all the other (original) openslide APIs.

The numpy support for the Openslide object is guaranteed via the special function __array__ which allows to re-interpret it as np.asarray obj. If it doesn't work you can take a look at that function.

Nico-Curti commented 2 years ago

Hi @slobodaapl! I have just update the library fixing the major part of the issues that you have pointed out. Now the Cython wrap should work properly as much as the dimension of the low resolution levels of the image. Now you can also provide the .ets or .tiff file directly to the Openslide object, avoiding the use of the general .vsi file.

Let me know if there is something else that doesn't work in your application.

slobodaapl commented 2 years ago

Thanks for the update! I will go test it out and use it in my code and give you feedback. Excited for it :)

Nico-Curti / openslide

Openslide python speed? #2