Closed slobodaapl closed 2 years ago
Hi @slobodaapl! I'm still working on the updated version of openslide, so there might be some computational issues... However, the computational performances of IO strongly depend on the WSI format that you are using: in my forked version I contributed only on the Olympus (.vsi, .ets, .ome-tiff) format. A criticality, already pointed out by other authors, is the decompression of JPEG2K format: it is particularly slow in openslide and very hard to improve.
About the Python wrap of the library, there are several points on which I'm currently working on, which can improve the computational performances:
However, regarding the problem that you describe, if your goal is to run a sliding window, I suggest you re-think your code implementation: IO steps are always particularly slow and represent the bottleneck of a lot of applications; a better approach would be to consider minimizing the IO and performing all the manipulations using memory buffers. The patche-sizes that you describe are very large, so I don't know if this could be a valid use case for you, but since my openslide version provides the integration with numpy, I suggest you take a look to the as_strided function of numpy: by reading a larger region of the image you can easily mimic a sliding window approach (avoiding copies in memory) using this function.
Let me know if I can help you further!
P.s. for issues or pr about the library, please open an issue/pr on the library repo: in this way our conversion could be useful also for other users who fall in the same problems. Thanks!
Thanks for your thorough reply!
To be more specific about my problem, I'm opening '.vsi' files which happens to be one of the slower formats, and my application is deep learning, and due to the amount of data I have, conversion is unfeasible.
I will definitely look into the use of NumPy's as_strided
function, it looks very promising, as I'm currently inefficiently making copies of array slices.
Since your library version is parallelizable, I'll also attempt to make use of that, which is a massive benefit over python-bioformats which runs using a Java bridge
Cheers and looking forward to future updates to your library!
Let me know if you do some upgrades on the VSI format support! I'm planning to commit a relatively big update of the library asap.
@Nico-Curti I've been working on it some and I do believe I found a bug with the library. I went to your repository for openslide, however, the Issues tab isn't available there, so I'm posting here as a reply instead.
I'm able to use most functionality of your openslide version just fine, however, I can't seem to save it as a numpy array representation:
filearray = Openslide('data/1_HE.vsi', dtype=BRIGHTFIELD)
filearray = np.array(filearray, copy=False)
Produces the error:
- Traceback (most recent call last):
- File "debug.py", line 16, in <module>
- filearray = np.array(filearray, copy=False)
- File "openslide.pyx", line 279, in openslide.Openslide.__array__
- File "openslide.pyx", line 475, in openslide.Openslide.read_region
- File "openslide.pyx", line 409, in openslide.Openslide._read_brightfield_region
- AttributeError: 'openslide.Openslide' object has no attribute 'get_level_dimensions'
The same happens if we attempt to take a slice, which seems even slower than read_region
, and ultimately produces the same error:
print(filearray[0:64, 0:64, 0].shape)
I'm also unsure how the buffering is handled for massive images (30+ GB), and whether interpreting it as a NumPy array would make it attempt to load the image into memory or whether it remains only a buffer that reads from the file only under certain conditions, same as with the as_strided
function you mentioned.
Hi @slobodaapl! Thanks for the bug report. I noticed that I left an old version of the code inside a check in the read_region
function (ref. here).
However, this means that when you try to read the image, you are falling inside a that if-branch, i.e. your attempt to read is failed.
It is hard for me to reproduce this error, since all the tests that I have done on my .vsi files seems to work good.
Please check that the directory tree related to your vsi file works. In particular, for a file called data/1_HE.vsi
you must have the following directory tree:
data/1_HE.vsi
|__ _1_HE_
|___ stack1
|___ frame_t.ets
|___ frame_t1.ets
|___ stack1000
|___ stackN
Pay attention also that the current version of the library support the management of only the first ets (or tif) file found in the first directory, i.e. the frame_t.ets
file in the case above! I'm working on it...
Hey @Nico-Curti, thanks for your reply! I see what you mean and it's surprising to me that there's a NULL
read on the data, because the layout you describe seems to be exactly as you describe, or more specifically:
data/1_HE.vsi
|__ _1_HE_
|___ stack1
|___ frame_t.ets
All my .vsi
files only have one stack associated with them with the exact structure, and interestingly enough, I am able to change the internal resolution and save the downscaled image to a file using opencv
just fine. It seems just that direct manipulations of the Openslide()
class instance via array operations fail with the error described above.
I also found that your library, using .shape
, gives different width
and height
dimensions than python-bioformats
for the same .vsi
file at the same level interestingly enough. I verified the dimensions and indeed, openslide
seems to over-estimate the bounds of the .vsi
image. If we attempt to read the region near the very edge of the file, in other words the region that shouldn't exist, we find we are indeed given a zero array:
from openslide import Openslide, BRIGHTFIELD
with Openslide('data/1_HE.vsi', dtype=BRIGHTFIELD) as slide:
print(slide.shape)
print(slide.header)
print(slide.read_region(103000, 105000, 0, 400, 400)[0:5, 0:5, 0])
This prints:
(105472, 103424)
{'COMMENT': None, 'VENDOR': 'olympus', 'QUICKHASH1': 'e9f1377f545caaa085a4b78e47518a0263ce4c49ba20818071434b15c9e1ace7', 'BACKGROUND_COLOR': 'FFFFFF', 'OBJECTIVE_POWER': None, 'MPP_X': '352.77777777777777', 'MPP_Y': '352.77777777777777', 'BOUNDS_X': '0', 'BOUNDS_Y': '0', 'BOUNDS_WIDTH': '105472', 'BOUNDS_HEIGHT': '103424'}
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]]
# This shouldn't be possible in my images, this is definitely incorrect
It seems openslide
doesn't read the OME Xml data correctly and comes up with incorrect image dimensions.
I wouldn't be opposed to sending you an example of the .vsi
file data I'm using so I can help you track down the issue, if it's not on my side. Just let me know if you're interested and I'll arrange to send you a link to your university email (or where you prefer).
I know that openslide tends to overestimate the image dimension and it is due to a not perfect division between the patch size stored into the OME XML header (typically 512) and the real level dimensions. I intentionally set the level shape to an "overestimated" dimension to allow the correct behavior of cairo and all the other (original) openslide APIs.
The numpy support for the Openslide object is guaranteed via the special function __array__
which allows to re-interpret it as np.asarray
obj. If it doesn't work you can take a look at that function.
Hi @slobodaapl! I have just update the library fixing the major part of the issues that you have pointed out. Now the Cython wrap should work properly as much as the dimension of the low resolution levels of the image.
Now you can also provide the .ets
or .tiff
file directly to the Openslide
object, avoiding the use of the general .vsi
file.
Let me know if there is something else that doesn't work in your application.
Thanks for the update! I will go test it out and use it in my code and give you feedback. Excited for it :)
Hey there! Recently found your openslide modification to use in Python, and I'm stoked there's something other than python bioformats. But I noticed that for reading regions (I need to do a sliding window so I can't avoid it), it's significantly slower than python bioformats, which surprised me.
Do you have any tips as to how make large region reading using your python library faster?.. Or what may be causing it to be slow? I simply opened the file as the brief documentation describes in your repo using the python context manager and attempted to load several successive regions within that context, all of them about 4000x4000, though with significantly less speed than bioformats.