Reading time differs for compressed and uncompressed documents

jurekkow commented 10 months ago

Describe the bug There is a significant difference in ROI reading speed, depending on the method of compression.

To Reproduce The following Python script:

import timeit

from pylibCZIrw import czi

file_paths = ["uncompressed.czi", "jpeg-xr-compressed.czi", "zstd-compressed.czi"]

roi_x = -126748
roi_y = 46095
z = 0
t = 0
c = 0
scene = 0

num_read = 1

for file_path in file_paths:
    reader = czi.CziReader(file_path)
    roi = (roi_x, roi_y, 2000, 2000)
    read_time = timeit.timeit(
        lambda: reader.read(roi=roi, plane={"T": t, "Z": z, "C": c}, scene=scene),
        number=num_read,
        globals=globals(),
    )
    print(f"{file_path}: {read_time}s")

Outputs:

uncompressed.czi: 0.01224821200594306s
jpeg-xr-compressed.czi: 0.46801461999712046s
zstd-compressed.czi: 0.051155578999896534s

Expected behavior Reading times of compressed and uncompressed documents don't differ, or differ less than 2x.

Desktop (please complete the following information):

OS: Ubuntu 20.04.4 LTS
Python Version: 3.9.11
pyLibCZI Version: 3.5.1

Additional context The original, JPEGXR-compressed document, was decompressed and ZSTD-compressed again using czicompress tool.

Test document will be delivered separately.

ptahmose commented 10 months ago

Hi @jurekkow,

Thank you for the detailed report on the ROI reading speed issue. I've taken some time to analyze the situation based on the numbers you've provided and have some insights to share.

I found that decompressing a single tile (2040x2040 pixels, 16-bit grayscale, which amounts to approximately 8MB) takes around 45ms; so we're looking at a decompression throughput of about 200MB/s. In the scenario you've described, there are actually 9 tiles which must be read and decompressed in order to compose the requested result. The tiling structure and the requested rectangle are illustrated in the following diagram: So - the timing you gave for the JPGXR-case (~500ms) seems reasonable and is explicable by the above findings.

In contrast, with the uncompressed CZI, the datarate we can see is around ~2000MB/s assuming an SSD-storage (and potentially even higher if the file resides in cache.).

Taking these observations into account, the results you're encountering appear to be in line with "expected system behavior". At this juncture, there doesn't seem to be a malfunction or bug in the process.

I'd have little hope that the performance of the JPGXR-decoding (or, zstd-decoding for that matter) can be improved here. I compared the decoding speed of libCZI's bundled jxr-codec and Windows-WIC-JPGXR-codec, and they came out about the same. Note that (on Windows) libCZI can work with both codecs (e.g. with CZICmd, the option --jpgxrcodec allows to choose between them).

However - surely there are options to improve performance of operation in this case.

The first idea is about "reduce the number of tiles which are to be read/decoded" here. This is about - in our case, the query ROI intersects with 9 tiles, but it might be that some tile is completely covered by other tiles (which are layered on top). So, some tiles may not be visible at all, and we could spare reading/decoding them. I prototyped this approach here - however, again in this case, it would only result in a reduction from 9 tiles to 8 tiles. Nevertheless - since it is rather easy to integrate and there are no drawbacks, I'd think this should be the first thing to do.
Next idea would be about parallelizing the decoding. So far, libCZI-operation within pylibczi is strictly single-threaded and performance has not been in focus much. I'd reckon that doing the decompression concurrently (at tile granularity) here should be feasible and give the expected performance gain. Of course, the computational load would remain the same, so it will only be faster if enough cores are available.
Finally, but this would not help for the case you gave here: I'd think about introducing caching. I mean caching of decompressed tiles - which would only be relevant if you'd need to read many ROIs from the same reader-object. And - the pattern in which you request the ROIs has to be "favorable" here, so that cached tiles can be made use of. E.g. when requesting the whole plane in small rectangles, starting on the top left, then move the ROI to the right till end of line, then start next line.

So, the first two bullets seem feasible and reasonably straightforward to me. The third bullet should have an enormous effect, or I'd think the speed-up can be orders of magnitude provided that your access pattern is favorable. Maybe you can give some insight about whether this would be desirable/applicable for your application.

Other than those ideas - I have to confess, the best option for the time being is to decompress the document before operation. At least I came to the conclusion, that there is no obvious flaw here, and the decompression-performance is as fast as we can expect it.

jurekkow commented 10 months ago

Hi @ptahmose.

Thanks for the thorough investigation. After analyzing suggestions from three bullets, the third one seems most applicable.

I do read many ROIs from the same reader object, and the ROIs are requested, in a particular order starting from top left moving to right, and then to the following line, exactly how you described. My only worry is that it may cause OOM issues, for reading large documents. To avoid that I'd need to add some cache invalidation logic. Anyway, since this might be worth trying here's a question: is there any built-in cache machine mechanism baked into libCZI/pylibCZIrw? If not then it's not quite clear to me how I could introduce caching just decompressed tiles, and not just ROIs read?

ptahmose commented 10 months ago

Anyway, since this might be worth trying here's a question: is there any built-in cache machine mechanism baked into libCZI/pylibCZIrw?

At this point, the answer is a clear "no". From the top of my head, I'd propose this approach:

create a new object "subblock_image_cache". Its job is to maintain a map "subblock-index <-> uncompressed bitmap". It needs to keep track of the memory usage and allow for cache eviction (e.g. by a LRU-mechanism).
this cache object is (optionally) passed to the accessor's composition-methods (i.e. here).
this composition-methods ("Get") now will check first the cache, and use the bitmap if present there. If not, it will decompress the image and put it into the cache.
I am undecided whether this cache should be maintained opaquely and internally with the reader-object, or whether it should be maintained at the application level. Maybe both ways can be implemented.
In any case, it needs to be ensure that the size of the cache is "under control", e.g. by imposing an upper limit to the memory used.
After this preliminary work, the functionality needs to be made available via the pylibczi-layer.

Other than that, with the current state of the pylibczi-API, I'd think the following would be possible:

When accessing the document, one would get a ROI which is "as large as possible" (i.e. a bitmap which would have a size determined by the "amount of RAM one is willing to spare for the operation").
This bitmap is then retrieved from pylibCZI, and this bitmap would then be sub-divided as needed by the application.

Performance-wise this idea should be on the same level as "introduce caching to libCZI", and it would benefit greatly from leveraging concurrency with the decoding, however the latency would obviously be rather bad (=the time it takes until the first small ROI would be available). Of course, it will introduce some complexity at the application-/Python-level. But, it would not require any changes to pylibczi/libczi I reckon.

Next steps from my side:

I'll add three new tickets for the "features described in the three bullets" (i.e. "ignore completely covered tiles", "allow for concurrent loading/decoding" and "allow for caching for small-ROI-access-pattern").
I'll try to work on those topics as time permits.

ptahmose commented 10 months ago

Another idea which crossed my mind - instead of using a "ROI-based access" (i.e. where the application is requesting an arbitrary ROI), maybe reading the data "tile by tile" could be worth considering. I.e. there would be no multi-tile-composition, the application would just read the existing tiles, one after the other. If there is no need on the application side for tile-composition, this should be the easiest approach. An additional benefit would be - parallelization could then take place on the Python-layer. I am not sure whether there is an API on Python-level which would allow for "tile-based access"?

jurekkow commented 10 months ago

I just would like to clarify this part:

I'll add three new tickets for the "features described in the three bullets" (i.e. "ignore completely covered tiles", "allow for concurrent loading/decoding" and "allow for caching for small-ROI-access-pattern").

I understood the original idea as caching of subblocks not ROIs. Caching ROIs might be as well implemented in the application logic, and we may consider it as an alternative to upfront uncompression.

Another idea which crossed my mind - instead of using a "ROI-based access" (i.e. where the application is requesting an arbitrary ROI), maybe reading the data "tile by tile"

This would require a rather big change in the application logic on our end. Also if here by "tile" you mean subblock, I don't think that pylibCZIrw currently supports that.

ptahmose commented 10 months ago

I understood the original idea as caching of subblocks not ROIs.

Ja, idea is to cache a "the bitmap contained in a subblock" once it is decompressed. This will improve performance in cases like this (where the red rectangle would be the query-ROI): In best case, each subblock will then be decoded only once (as opposed to - at least one subblock being decompressed for each Get-operation). Obviously, this may not bring things down to "each subblock is decompressed exactly once" without further ado - e.g. in the case depicted above, when the ROI continues in the second line, we may have exceeded the cache size (and the subblocks are then decompressed a second time). So, further optimization could still be possible then (e.g. adapting the scan-pattern), but I am sure this will be huge improvement and reduce "how many times the subblocks are actually decompressed" by a huge factor.

ptahmose commented 10 months ago

I created new tickets for the three ideas we came up with in the course of this discussion. As far as this ticket is concerned - I guess there is not much left "to be done/discussed" here for now.

ptahmose commented 9 months ago

wrt to #76 - I'd give this idea a lower priority at the moment, so I'd plan to conclude activities in this context with the two optimizations which have been done so far for the time being.

ZEISS / libczi

Reading time differs for compressed and uncompressed documents #73