amspath / libisyntax

BSD 2-Clause "Simplified" License
12 stars 7 forks source link

API discussion #4

Open Falcury opened 1 year ago

Falcury commented 1 year ago

BGRA vs RGBA

Right now we are returning pixels in BGRA format, but we could just as easily return them in RGBA. Should we allow specifying a desired pixel format in the API? And maybe for grayscale too (i.e. returning the Y channel without further color space conversion)?

Role of wsi_image_t

Currently we are exposing wsi_image_t handles in the API. Can we get away with hiding this structure as an implementation detail, or is that a bad idea? And probably we should add accessors for the label and macro images (maybe mapping to the OpenSlide counterparts)?

Role of isyntax_cache_t

Can this also be set as part of the internal state of isyntax_t? I can see why separating it could be useful because the cache might be shared across multiple isyntaxes. But there is also some messiness (which I am not yet sure is necessary). For example, right now internally I end up needing to pass in the isyntax_t's own ll_coeff_block_allocator in isyntax_load_tile(), which seems a bit superfluous: u32* tile_pixels = isyntax_load_tile(isyntax, wsi, scale, tile_x, tile_y, isyntax->ll_coeff_block_allocator, true); Another thing: I guess that right now resources associated with a WSI are not freed upon calling isyntax_destroy(), as long as the cache is still alive?

'reader' vs. 'streamer' mode

Should we provide ways to 'hint' how the WSI will be accessed, in order to give optimal performance in each scenario? In a 'streaming' / interactive viewing situation you would might want to proactively/greedily precache substantial parts of the WSI (like is being done in Slidescape), in order to counteract the high latency that may result from needing to multiple successive idwts. Also in that case it might make sense to not evict caches too aggressively (as RAM or VRAM allows). However if you are only interested in getting the pixels of a single specific region, this is probably overkill and not what you want.

read_region() implementation

I would like for libisyntax to (eventually) supply its own read_region() implementation, also correcting for the offset shift in successive levels. (Maybe this is homework for me to do.)

Custom I/O functions

Allowing for custom file access procedures, as mentioned by @bgilbert.

Custom allocators

Should we allow custom allocators, also e.g. for libisyntax_tile_read/ libisyntax_tile_free_pixels?

jonasteuwen commented 1 year ago

For the first question: if you want to be consistent with OpenSlide, it might make more sense to return RBGA, the python API converts to RGBA. As you are doing now in the example seems fine.

As far as streamer/reader, one big use case are obviously computational pathology models requiring random access to tiles. In such a case it might make most sense to only cache the header (or the tile locations for that matter).

bgilbert commented 1 year ago

OpenSlide actually uses premultiplied ARGB, addressed as 32-bit values, because that's what Cairo uses. When addressed byte-wise, this becomes BGRA on little-endian systems. In practice, that seems to be a less common pixel format in imaging libraries, both the channel ordering and the use of premultiplied alpha. (The latter doesn't matter here if libisyntax produces only opaque pixels.) So ARGB would be convenient for OpenSlide but probably not for other users.

Right now we are returning pixels in BGRA format, but we could just as easily return them in RGBA.

I looked at convert_ycocg_to_bgra_block, and while I didn't carefully study the fast path, the ycocg_to_bgr fallback path produces ARGB on little-endian systems and BGRA on big-endian systems when considered as 32-bit values. If you intend the returned values to be byte-addressed, they should be uint8_t in the API instead of uint32_t.

(It's true that ~nobody uses big-endian anymore, but OpenSlide is built and shipped on big-endian systems (s390x) and we do try to be correct there.)

Falcury commented 1 year ago

I looked at convert_ycocg_to_bgra_block, and while I didn't carefully study the fast path, the ycocg_to_bgr fallback path produces ARGB on little-endian systems and BGRA on big-endian systems when considered as 32-bit values. If you intend the returned values to be byte-addressed, they should be uint8_t in the API instead of uint32_t.

(It's true that ~nobody uses big-endian anymore, but OpenSlide is built and shipped on big-endian systems (s390x) and we do try to be correct there.)

Right, I hadn't considered that the endianness of the system could change the interpretation of the returned pixels. On the point of the return type, I think I mainly saw returning als uint32_t as convenient because it self-documents the number of bytes per pixel. But I also see no problem with returning as uint8_t. I also agree it would be good to be mindful of endianness in general, and to be precise in our documentation (while causing the least possible confusion...). What do you think, @alex-virodov?

Falcury commented 1 year ago

For the first question: if you want to be consistent with OpenSlide, it might make more sense to return RBGA, the python API converts to RGBA. As you are doing now in the example seems fine.

Right, maybe that should be the default then. And to avoid the need for conversion altogether I think we can provide multiple variants for you to choose from.

As far as streamer/reader, one big use case are obviously computational pathology models requiring random access to tiles. In such a case it might make most sense to only cache the header (or the tile locations for that matter).

You're right, this would be ideal, however without considerable caching I think the problem quickly becomes latency. I think you also ran into this for the TIFF converter example you posted. Decoding a single tile can require tens to hundreds of prerequisite tiles to first be decoded. The number of prerequisites balloons quickly because we need not only all the levels above, but also the 8 adjacent neighbors of each tile because the edges of the tile otherwise will not be reconstructed properly. (There may be a way to mitigate this somewhat, but we won't be able to solve it completely.)

alex-virodov commented 1 year ago

Right now we are returning pixels in BGRA format, but we could just as easily return them in RGBA. Should we allow specifying a desired pixel format in the API?

I like this idea, especially using some out_pixel_format instead of return_rgb (with the option of out_pixel_format=None). I will do the API change.

Currently we are exposing wsi_image_t handles in the API. Can we get away with hiding this structure as an implementation detail, or is that a bad idea?

Do you know of a case where there would be more than one wsi_image_t? Are we using wsi_image_t for label/macro? From this code, I'd guess yes, but I have not worked with label/macro using libisyntax yet. https://github.com/amspath/libisyntax/blob/main/src/isyntax/isyntax.c#L1183

// We started parsing a new image (which will be either a WSI, LABELIMAGE or MACROIMAGE).
parser->current_image = isyntax->images + isyntax->image_count;

And probably we should add accessors for the label and macro images (maybe mapping to the OpenSlide counterparts)?

we should, any reason to not use wsi_image_t for that? Or are label/macro images represented differently in isyntax_t structure? Sorry, I have not touched this part yet.

Role of isyntax_cache_t: Can this also be set as part of the internal state of isyntax_t?

If we make it part of isyntax_t, e.g. with a shared_ptr equivalent, this will require cache uninitializiation when the last isyntax file closes, and reinit when the next one opens. I guess I have a slight preference to keep the global cache around, even if there are no isyntax files open currently. That will require exposing some cache handle.

But I can be easily convinced that keeping an external global cache even if there are no isyntax currently open is a premature/bad optimization :)

If we keep it, I agree that the current internal implementation is messy and can be improved.

Another thing: I guess that right now resources associated with a WSI are not freed upon calling isyntax_destroy(), as long as the cache is still alive?

Correct. I added libisyntax_cache_flush(), but this is a separate call that the user needs to make. This one is a good point, I need to think about how to make sure tiles are released on close. I agree cache management right now is clumsy, and needs more thought.

Should we provide ways to 'hint' how the WSI will be accessed, in order to give optimal performance in each scenario?

I suspect hint may not be enough :) Maybe 2 different APIs with their own rules. This way for example, we can have callbacks in the streaming api but not tile reading api. I'd say this stems from the very different usecases, and thus justifies different apis. Under the hood we can try to merge/reuse code as much as we can.

Should we allow custom allocators, also e.g. for libisyntax_tile_read/ libisyntax_tile_free_pixels?

I'd suggest that for some later Phase N. We can also make a variant of libisyntax_tile_read that takes a buffer in, as long as the user knows how to compute needed buffer size (tile width x tile height x 4, and we have getters for those).

jonasteuwen commented 1 year ago

The labelimage and macroimage are simple base64 encoded JPEG images. I think we can simplify the API a bit here, and using '.wsi' instead of .images[0]. We can have .label_image and .macro_image with a pointer and length to the base64-encoded string that has this value (so we can read it when needed and keeping the stuff in memory small). What about that?

alex-virodov commented 1 year ago

Quick update on isyntax_cache_t - I think it can go all under the API, which will also give us flexibility to adjust it as time goes on. I will work on a PR for that.

I will still want some cache size control - we can either expose it via proper api, or make libisyntax read environment variables (which is what I do currently in OpenSlide). Proper API may be specific, e.g. libisyntax_set_cache_size(int32_t size) or generic like libisyntax_hints(int32_t num_hints, isyntax_hint_t* hints). I think I prefer specific - new functions are not expensive :) I'll see how the code looks like as I work on it.

jonasteuwen commented 1 year ago

Having a specific function to set cache size indeed should be fine, and no need to expose all the rest.

How do you intend to combine it with openslide cache?

alex-virodov commented 1 year ago

How do you intend to combine it with openslide cache?

Note that we have 2 different caches:

  1. The OpenSlide caches RGB(A) pixels, and I use that to avoid repeated requests for same tile (here).

  2. We need to cache wavelet coefficients, as they are required for child/neighbor tile decoding.

So both of these caches are necessary. Without OpenSlide cache, we would do repeated requests for a tile, and thus repeated decoding of wavelet->rgb(a). Without wavelet coefficient cache, we will need to re-read them every time from file (very inefficient).

Hope that helps, please ask again if I didn't answer your question.

jonasteuwen commented 1 year ago

Yes, so OpenSlide caches tiles, which don't need to be decoded anymore. At the same time your wavelet-coefficients cache will keep these in memory, while they are not necessarily needed anymore as OpenSlide has it in their cache. Or do you expect this to be minor?

For random access, it seems also a good idea to cache the header itself, so you don't need to parse that horrendous base64 string to get the seek table. Random access likely is also often the same resolution, so you might be able to keep a cache of a few of the lower levels in that, but that doesn't seem to be very easy to predict.

Falcury commented 1 year ago

Yes, so OpenSlide caches tiles, which don't need to be decoded anymore. At the same time your wavelet-coefficients cache will keep these in memory, while they are not necessarily needed anymore as OpenSlide has it in their cache. Or do you expect this to be minor?

Only for level 0, it holds that the LL coefficients (i.e. actual color channel values, not yet transformed to proper RGBA or BGRA) do not need to be reused, because in that case there are no 'children' that depend on them. For levels > 0, you can't rule out that somebody will also want a tile of a lower level. The LL coefficients will be required for that, so you can't discard them without running the risk of needing to redo all the work again (which affects performance in a very major way).

Note that in Slidescape, LL (and the other) coefficients of higher levels will currently get discarded if their children are also loaded. But that's only reasonable because currently in Slidescape there are no situations where you would load the same tile twice (there is no cache eviction scheme as of yet).

For random access, it seems also a good idea to cache the header itself, so you don't need to parse that horrendous base64 string to get the seek table. Random access likely is also often the same resolution, so you might be able to keep a cache of a few of the lower levels in that, but that doesn't seem to be very easy to predict.

We're only parsing the header and seektable once in isyntax_open(), and never looking at it again after that. So essentially, we are already doing what you describe.

Precaching the topmost levels (i.e. the codeblocks contained in the topmost data clusters in the iSyntax file) is an additional step we could do after that to speed up any subsequent read request. Slidescape does this: image

jonasteuwen commented 1 year ago

Alright. For the isyntax_open: If you have many files in your ML workflow, you will want to close them once you've read (a few?) tiles as otherwise, this will lead to too many open file handlers, and your OS will complain. So each time you'd need to read the header. This is also how it is implemented in dlup: https://github.com/NKI-AI/dlup/blob/52a65d06a4f636b94cb142d00ccfc2f98b6a966b/dlup/data/dataset.py#L317 you keep a LRU cache for the file objects, but at some point you'd need to evict that to ensure the number of file handlers remains low.

So, my argument would at least be to try to optimize the reading of the header

Falcury commented 1 year ago

So, my argument would at least be to try to optimize the reading of the header

Ah, I think I see what you're saying. In that case, I think the best strategy would be to store something like a binary index file separately, that contains all the information we need from the header (plus, maybe the (decompressed?) codeblocks for the topmost levels).

Unfortunately, the iSyntax header is not really designed in such a way that it can be very efficiently read. Its information density is rather sparse and there is a lot of parsing work. I already did my best to optimize it (reading it cold from disk), if there is any way it could be improved further then I am of course very much interested in doing that!

jonasteuwen commented 1 year ago

I was developing my own version before you relicensed this library. I think I made the header reading a bit faster than the current one, as I had the random-access application in my mind. I can also look into it when libisyntax has the features you're interested in. OpenSlide's requirement that they want to handle that might be a bit tricky to merge though.

As far as I understand, the OpenSlide cache would allow keeping tiles in memory across slides, doing something like that with the header would be useful there (no need to keep the file handler around).

Falcury commented 1 year ago

Do you know of a case where there would be more than one wsi_image_t? Are we using wsi_image_t for label/macro? From this code, I'd guess yes, but I have not worked with label/macro using libisyntax yet. https://github.com/amspath/libisyntax/blob/main/src/isyntax/isyntax.c#L1183

And probably we should add accessors for the label and macro images (maybe mapping to the OpenSlide counterparts)?

we should, any reason to not use wsi_image_t for that? Or are label/macro images represented differently in isyntax_t structure? Sorry, I have not touched this part yet.

@alex-virodov The label and macro image are also represented in the XML header as separate images. Realistically though, the file only ever contains one WSI, as well as one label and one macro image (although the label image may be blanked out/replaced with a placeholder if anonymized).

I think for the user-facing API it might be most elegant to simply query the isyntax_t for the label or macro pixels, without going through the trouble of needing to specify an isyntax_image_t. To handle the JPEG decoding we could use stb_image.h as default (decoded upon calling the API). If for some reason somebody wants to handle the JPEG decoding themselves in another way, we could also provide a way in the API to get the JPEG-compressed image directly.

Role of isyntax_cache_t: Can this also be set as part of the internal state of isyntax_t?

If we make it part of isyntax_t, e.g. with a shared_ptr equivalent, this will require cache uninitializiation when the last isyntax file closes, and reinit when the next one opens. I guess I have a slight preference to keep the global cache around, even if there are no isyntax files open currently. That will require exposing some cache handle.

But I can be easily convinced that keeping an external global cache even if there are no isyntax currently open is a premature/bad optimization :)

Maybe we should see it as two separate problems? (1) the block allocator requires its own housekeeping, and having separate block allocators for each isyntax_t wastes too much memory (2) there is a need for a resource eviction scheme for each isyntax_t, and the lifetime of resources associated with a single isyntax_t should correspond to the lifetime of that single isyntax_t.

For (1) I agree an external structure would likely be appropriate, for (2) I expect we should be able to simplify and make it part of the isyntax_t.

I think I would also prefer to set up the LRU scheme for each isyntax_t individually, because which blocks you want to keep/evict is individual to each slide and won't match well with the global least recently used order if you're working on multiple slides in sequence. If there is a need to actively shrink the cache on slides that were not recently accessed, then maybe there could be a isyntax_resize_cache() or something, that would make sure that caches are evicted in a way that is strategic for that individual slide.

I suspect hint may not be enough :) Maybe 2 different APIs with their own rules. This way for example, we can have callbacks in the streaming api but not tile reading api. I'd say this stems from the very different usecases, and thus justifies different apis. Under the hood we can try to merge/reuse code as much as we can.

Yeah, I agree with your thought process. Maybe it would be most elegant if we could still bring them together under the same roof at some point. I guess we need more time to figure this out.

Should we allow custom allocators, also e.g. for libisyntax_tile_read/ libisyntax_tile_free_pixels?

I'd suggest that for some later Phase N. We can also make a variant of libisyntax_tile_read that takes a buffer in, as long as the user knows how to compute needed buffer size (tile width x tile height x 4, and we have getters for those).

Now that you mention it, I think that's actually a good thing to be the default.

alex-virodov commented 1 year ago

I think I would also prefer to set up the LRU scheme for each isyntax_t individually,

Right now you can. Set global_cache = 0, and that's the behavior. However, I don't think we ever should, see below.

because which blocks you want to keep/evict is individual to each slide and won't match well with the global least recently used order if you're working on multiple slides in sequence.

Consider this scenario:

  1. Open 1000 slides
  2. Read 5 random tiles from each of 1000 slides
  3. Read another 5 random tiles from each of 1000 slides
  4. close 1000 slides.

If you don't have global cache, you will have to set cache to very small to survive the 1000 slides each holding a cache of LL coefficients that won't be useful while 999 other slides are being read. Having a global cache enables slide 2 to evict tiles of slide 1 because we really don't know when slide 1 will be read again, while allowing slide 2 to use that same memory to handle a burst of reads.

Example of this scenario is e.g. here: https://github.com/MSKCC-Computational-Pathology/MIL-nature-medicine-2019/blob/master/MIL_train.py#L167

This is happening because people assume that both opening a slide and reading a tile are cheap O(1) memory operations. This is not the case for Philips iSyntax unfortunately.

If there is a need to actively shrink the cache on slides that were not recently accessed, then maybe there could be a isyntax_resize_cache() or something, that would make sure that caches are evicted in a way that is strategic for that individual slide.

And I will put it in openslide after each call to libisyntax_tile_read(), because I don't know the usecase, so may as well put it under the api.


Hope that helps, let's keep exploring. If there's a more elegant solution, I'm all for it.

alex-virodov commented 1 year ago

Should we allow custom allocators, also e.g. for libisyntax_tile_read/ libisyntax_tile_free_pixels?

I'd suggest that for some later Phase N. We can also make a variant of libisyntax_tile_read that takes a buffer in, as long as the user knows how to compute needed buffer size (tile width x tile height x 4, and we have getters for those).

Now that you mention it, I think that's actually a good thing to be the default.

Proposed implementation in https://github.com/amspath/libisyntax/pull/24. I think that's nicer, the user has more control of the buffers, allowing custom allocators etc.

alex-virodov commented 1 year ago

@Falcury re isyntax_cache_t, I'm trying to push it under the API and it keeps being ugly, mostly because we can't init the allocators until we know the block size. One way to simplify it is to always have the cache shared, and I wanted your take on this. Specifically, do you ever work with more than one slide at a time in Slidescape? Or any other scenario? If not, I don't see the downside of always shared cache (with size INT_MAX).

For OpenSlide, even if we keep supporting non-shared cache and allocators, I expect to always enable the shared cache, as from within OpenSlide I don't know if this is a 1-slide or a 1000-slide scenario unfortunately.

Maybe we can have a zoom/teams meeting about this, just to see if we can figure out a better way.

Falcury commented 1 year ago

@Falcury re isyntax_cache_t, I'm trying to push it under the API and it keeps being ugly, mostly because we can't init the allocators until we know the block size. One way to simplify it is to always have the cache shared, and I wanted your take on this. Specifically, do you ever work with more than one slide at a time in Slidescape? Or any other scenario? If not, I don't see the downside of always shared cache (with size INT_MAX).

@alex-virodov In Slidescape you can currently only work with one slide at a time (actually, two, if you load a second slide as an overlay). However, I'd say it's indeed a realistic scenario to want to view 2 or more (say, up to 8) slides side-by-side interactively. As an example, the Philips IMS and Sectra viewers allow splitting the display into multiple viewports, and as a clinical pathologist it's really useful to be able to compare images side-by-side (e.g. HE and an immuno stain).

Even so, always having the cache shared does seem like a good idea to me. It 'feels like' it should be possible to figure out a good way to have the allocator shared, but the housekeeping/eviction scheme (partially?) separate. One loose idea in my head is to have a garbage-collector-like 'freeze the world' lock where we do a cleanup operation for all loaded slides (either when necessary when the cache size is exceeded, or at a strategic time on a separate thread when an API call is finished). Another loose idea is to think about replacing the linked list with timestamps that could be sorted both globally and on a per-isyntax basis, during periodic clean-up (may give more flexiblity). Still, these are only loose ideas.

For OpenSlide, even if we keep supporting non-shared cache and allocators, I expect to always enable the shared cache, as from within OpenSlide I don't know if this is a 1-slide or a 1000-slide scenario unfortunately.

Yeah, I think it might be good to roll with this while we think of a more general solution.

Maybe we can have a zoom/teams meeting about this, just to see if we can figure out a better way.

Sure!