ZEISS / libczi

libCZI is an Open Source Cross-Platform C++ library to read and write CZI.
https://ZEISS.github.io/libczi/
GNU Lesser General Public License v3.0
47 stars 14 forks source link

Reading time differs for compressed and uncompressed documents #73

Open jurekkow opened 10 months ago

jurekkow commented 10 months ago

Describe the bug There is a significant difference in ROI reading speed, depending on the method of compression.

To Reproduce The following Python script:

import timeit

from pylibCZIrw import czi

file_paths = ["uncompressed.czi", "jpeg-xr-compressed.czi", "zstd-compressed.czi"]

roi_x = -126748
roi_y = 46095
z = 0
t = 0
c = 0
scene = 0

num_read = 1

for file_path in file_paths:
    reader = czi.CziReader(file_path)
    roi = (roi_x, roi_y, 2000, 2000)
    read_time = timeit.timeit(
        lambda: reader.read(roi=roi, plane={"T": t, "Z": z, "C": c}, scene=scene),
        number=num_read,
        globals=globals(),
    )
    print(f"{file_path}: {read_time}s")

Outputs:

uncompressed.czi: 0.01224821200594306s
jpeg-xr-compressed.czi: 0.46801461999712046s
zstd-compressed.czi: 0.051155578999896534s

Expected behavior Reading times of compressed and uncompressed documents don't differ, or differ less than 2x.

Desktop (please complete the following information):

Additional context The original, JPEGXR-compressed document, was decompressed and ZSTD-compressed again using czicompress tool.

Test document will be delivered separately.

ptahmose commented 10 months ago

Hi @jurekkow,

Thank you for the detailed report on the ROI reading speed issue. I've taken some time to analyze the situation based on the numbers you've provided and have some insights to share.

I found that decompressing a single tile (2040x2040 pixels, 16-bit grayscale, which amounts to approximately 8MB) takes around 45ms; so we're looking at a decompression throughput of about 200MB/s. In the scenario you've described, there are actually 9 tiles which must be read and decompressed in order to compose the requested result. The tiling structure and the requested rectangle are illustrated in the following diagram: image So - the timing you gave for the JPGXR-case (~500ms) seems reasonable and is explicable by the above findings.

In contrast, with the uncompressed CZI, the datarate we can see is around ~2000MB/s assuming an SSD-storage (and potentially even higher if the file resides in cache.).

Taking these observations into account, the results you're encountering appear to be in line with "expected system behavior". At this juncture, there doesn't seem to be a malfunction or bug in the process.

I'd have little hope that the performance of the JPGXR-decoding (or, zstd-decoding for that matter) can be improved here. I compared the decoding speed of libCZI's bundled jxr-codec and Windows-WIC-JPGXR-codec, and they came out about the same. Note that (on Windows) libCZI can work with both codecs (e.g. with CZICmd, the option --jpgxrcodec allows to choose between them).

However - surely there are options to improve performance of operation in this case.

So, the first two bullets seem feasible and reasonably straightforward to me. The third bullet should have an enormous effect, or I'd think the speed-up can be orders of magnitude provided that your access pattern is favorable. Maybe you can give some insight about whether this would be desirable/applicable for your application.

Other than those ideas - I have to confess, the best option for the time being is to decompress the document before operation. At least I came to the conclusion, that there is no obvious flaw here, and the decompression-performance is as fast as we can expect it.

jurekkow commented 10 months ago

Hi @ptahmose.

Thanks for the thorough investigation. After analyzing suggestions from three bullets, the third one seems most applicable.

I do read many ROIs from the same reader object, and the ROIs are requested, in a particular order starting from top left moving to right, and then to the following line, exactly how you described. My only worry is that it may cause OOM issues, for reading large documents. To avoid that I'd need to add some cache invalidation logic. Anyway, since this might be worth trying here's a question: is there any built-in cache machine mechanism baked into libCZI/pylibCZIrw? If not then it's not quite clear to me how I could introduce caching just decompressed tiles, and not just ROIs read?

ptahmose commented 10 months ago

Anyway, since this might be worth trying here's a question: is there any built-in cache machine mechanism baked into libCZI/pylibCZIrw?

At this point, the answer is a clear "no". From the top of my head, I'd propose this approach:

Other than that, with the current state of the pylibczi-API, I'd think the following would be possible:

Performance-wise this idea should be on the same level as "introduce caching to libCZI", and it would benefit greatly from leveraging concurrency with the decoding, however the latency would obviously be rather bad (=the time it takes until the first small ROI would be available). Of course, it will introduce some complexity at the application-/Python-level. But, it would not require any changes to pylibczi/libczi I reckon.

Next steps from my side:

ptahmose commented 10 months ago

Another idea which crossed my mind - instead of using a "ROI-based access" (i.e. where the application is requesting an arbitrary ROI), maybe reading the data "tile by tile" could be worth considering. I.e. there would be no multi-tile-composition, the application would just read the existing tiles, one after the other. If there is no need on the application side for tile-composition, this should be the easiest approach. An additional benefit would be - parallelization could then take place on the Python-layer. I am not sure whether there is an API on Python-level which would allow for "tile-based access"?

jurekkow commented 10 months ago

I just would like to clarify this part:

I'll add three new tickets for the "features described in the three bullets" (i.e. "ignore completely covered tiles", "allow for concurrent loading/decoding" and "allow for caching for small-ROI-access-pattern").

I understood the original idea as caching of subblocks not ROIs. Caching ROIs might be as well implemented in the application logic, and we may consider it as an alternative to upfront uncompression.

Another idea which crossed my mind - instead of using a "ROI-based access" (i.e. where the application is requesting an arbitrary ROI), maybe reading the data "tile by tile"

This would require a rather big change in the application logic on our end. Also if here by "tile" you mean subblock, I don't think that pylibCZIrw currently supports that.

ptahmose commented 10 months ago

I understood the original idea as caching of subblocks not ROIs.

Ja, idea is to cache a "the bitmap contained in a subblock" once it is decompressed. This will improve performance in cases like this (where the red rectangle would be the query-ROI): image In best case, each subblock will then be decoded only once (as opposed to - at least one subblock being decompressed for each Get-operation). Obviously, this may not bring things down to "each subblock is decompressed exactly once" without further ado - e.g. in the case depicted above, when the ROI continues in the second line, we may have exceeded the cache size (and the subblocks are then decompressed a second time). So, further optimization could still be possible then (e.g. adapting the scan-pattern), but I am sure this will be huge improvement and reduce "how many times the subblocks are actually decompressed" by a huge factor.

ptahmose commented 10 months ago

I created new tickets for the three ideas we came up with in the course of this discussion. As far as this ticket is concerned - I guess there is not much left "to be done/discussed" here for now.

ptahmose commented 9 months ago

wrt to #76 - I'd give this idea a lower priority at the moment, so I'd plan to conclude activities in this context with the two optimizations which have been done so far for the time being.