Bug: Nuclei detection on cellSens VSI fails

andreped commented 1 year ago

I just tried to run the Nuclei Detection algorithm that exist in HistomicsTK. I observed this for the OS-1.vsi which is openly available as part of the OpenSlide test data (see here).

I don't know if this just affects the nuclei detection algorithm, but I would guess "all" algorithms do the same trick when processing this format. I also observed the same for my own custom plugin, but it was greatly inspired by the nuclei detection method, hence, hard to say. I also observed the same on different WSIs in the same cellSens VSI format.

In the log, I can see this:

>> Reading input image ... 

2023-02-25 22:47:21,991 - distributed.core - INFO - Starting established connection to tcp://172.18.0.5:59936
INFO:distributed.core:Starting established connection to tcp://172.18.0.5:59936
INFO:large_image:Using memcached for large_image caching
INFO:large_image:Started JVM for Bioformats tile source.
22:47:24.117 [Thread-0] WARN loci.formats.FormatHandler - Missing expected .ets files in /mnt/girder_worker/678342f3d939490db8b0ce5e893c23a5/_OS-1_
{
  "levels": 1,
  "sizeX": 262,
  "sizeY": 512,
  "tileWidth": 262,
  "tileHeight": 512,
  "magnification": null,
  "mm_x": null,
  "mm_y": null
}

First of all, both DSA and HistomicsTK renders the WSI just fine. It is just inference/analysis with the .vsi format that fails.

Second, it seems like DSA fails to parse the magnification/resolution information of the WSI. That is not that surprising. We have the same problem in FastPathology with this format. There we just assume the WSI is 40x if we are unable to parse it and hope for the best.

However, as you are using bioformats to parse, which QuPath uses, I would think you should be able to parse the magnification level, no?

Lastly, since the format handler fails to find the .ets file maybe it assumes the wrong folder structure during analysis? From reading the logs, I would guess it did assume that the .ets file would be directly within the _OS-1_ and not one level deeper.

The correct structure is like so:

    └── WSI
        ├── OS-1.vsi
        └── _OS-1_/
               ├── stack1/
               |      └── frame_t.ets
               └── stack10001/
                      └── frame_t.ets

which I believe is standard for the Olympus format.

Alternatively, perhaps it fails as there are two stacks? If I recall correctly, the two correspond to the WSI (512 MB) and thumbnail (9 MB) images. Could be that HistomicsTK chooses the wrong one accidentally? In this case, I believe the stack10001/ correspond to the WSI.

andreped commented 1 year ago

I would think this would fail: https://github.com/girder/large_image/blob/master/large_image/tilesource/base.py#L754

andreped commented 1 year ago

Just mentioning that I am now able to run inference with a deep learning model using my own custom plugin (https://github.com/andreped/FP-dsa-plugin), but the end users mostly have their images in the VSI format. Hence, having good support for the format is detrimental.

Screenshot from 2023-02-26 13-02-46

manthey commented 1 year ago

Basically, the worker is passed "the" image file, but VSI is a multi file format, and the worker doesn't have all of it to work with inside the docker. As most of our use to date has been on single file WSI formats, this hasn't been addressed -- it probably isn't that much work, but the worker would need to be updated to properly pull references to the other files used in the WSI.

We are using the largest of the multiple stacks through the bioformats reader.

I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read.

andreped commented 1 year ago

Basically, the worker is passed "the" image file, but VSI is a multi file format, and the worker doesn't have all of it to work with inside the docker. As most of our use to date has been on single file WSI formats, this hasn't been addressed -- it probably isn't that much work, but the worker would need to be updated to properly pull references to the other files used in the WSI.

That explains it. There exists other WSI formats which also stores their data in this folder structure, sadly, even in OpenSlide. Hence, it is a feature I guess others also would welcome. At least if they wish to use the power of HistomicsTK to run algorithms on their own images through the web UI.

I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read.

As I mentioned, we were also unable to find the magnification level from the cellSens VSI format. However, we made our own custom reader in C++, as BioFormats is terribly supported in C++. QuPath uses BioFormats, and I would think from here one should be able to find how to get the correct magnification level from the cellSens VSI format.

andreped commented 1 year ago

Just tried running the Nuclei Detection algorithm which is part of HistomicsTK on the full CMU-1.tiff WSI from the OpenSlide test data. With default settings without setting the bounding box or anything.

I observed the same issue as for my own plugin (which is heavily inspired by the NucleiDetection built in method). The job is successful, but the annotations are not made available as annotations nor are they rendered. The analysis took about 7-8 minutes to run.

This was observed using the Chromium browser on Ubuntu 18.04, on the same machine where DSA is running.

I have attached the full terminal logs from the DSA docker-compose up, from the start of running the nuclei detection method to success (amit no stored annotations), I also refreshed to see if the annotations would show up (which was also parts of the logs at the end). nuclei_detection_full_wsi_logs.txt

manthey commented 1 year ago

Can you look in the actual girder log file (info.log) rather than the docker logs (which are the stdout / stderr of the docker container). If you used the reference deployment, these will be at devops/dsa/logs/info.log. Otherwise, the web api can be used to fetch parts of the log without knowing where it is stored.

andreped commented 1 year ago

Can you look in the actual girder log file (info.log)

Below I have shared you the girder log file. Seems like a GirderException was raised due to Read exceeds maximum allowed size. info.log

From this it looks like there is a 16 MB limit when reading from a single JSON file. Right now all annotations are stored in a single JSON file (from the plugin). Same applies to the NucleiDetection example. Was I supposed to store it differently?

manthey commented 1 year ago

No -- it is an unintended limitation. I've made an issue to track it: https://github.com/girder/large_image/issues/1073, but may not get to it until next week.

manthey commented 1 year ago

See https://github.com/girder/large_image/pull/1075 for a fix but not a total solution. Apparently that restriction wasn't there before we made a change to better support S3 assetstores -- it could just use too much memory, though.

andreped commented 1 year ago

See girder/large_image#1075 for a fix but not a total solution. Apparently that restriction wasn't there before we made a change to better support S3 assetstores -- it could just use too much memory, though.

The solution you suggested makes sense to me. Looks like you are reading in the annotations in chunks. But I guess if the annotation file itself is extremely large, there could be memory issues, but should work fine for my application. Thanks a lot!

Just tried pulling the latest DSA image and it seems to work well! At least for the built in nuclei detection method in HistomicsTK :] Will try my other custom pipelines now, but I believe it should work. Of course the method itself does funny stuff in some regions, but I believe this method was just there to demonstrate a pipeline.

The large annotation storage reading issue was more related to a separate issue: https://github.com/DigitalSlideArchive/HistomicsTK/issues/977#issuecomment-1453073744

I therefore closed that issue, but this issue remains open until we are able to parse the magnification level information from the cellSens VSI format.

Screenshot from 2023-03-03 07-59-02

andreped commented 1 year ago

Oh, note that after the job is done, it takes quite long until the annotation becomes available under Annotations. Is the JSON format threadsafe? Could it be possible to speed up loading by using multiple workers?

Anyways, ran a deep learning model I have on an image that is about 160k x 120k packed with nuclei, and at least annotations are stored and I can render the result :]

Screenshot from 2023-03-03 08-19-10

manthey commented 1 year ago

If you look at the logs you'll see the time it takes for various parts of the annotation json parsing and ingest. Currently, we use the orjson package to read the json. This means that the file and the decoded json are in memory at the same time. Using something like ijson would reduce the memory footprint, but actually be substantially slower (maybe by a factor of 3 to 5 in this stage). We validate the json against the schema -- this step could be parallelized. We then insert the data into the database; this step IS parallelized based on the CPU count.

On my faster desktop where I ingested ~320,000 nuclei polygons, decoding takes 16s, validation less than 10s, ingest 115s. On a slower machine, the decoding is 38s, validation 20s, and ingest 141s. The ingest is limited to mongo's speed (we should have used Postgres, but that is a big refactor). I think if an iterative json parser was used, the slowness of that parser would be amortized over the multiple threads, but because mongo is the limiting factor currently, looking in to how to speed up those inserts would have the most impact on speed (but we already do unsorted inserts, so I'm not sure how to improve things).

andreped commented 1 year ago

Thank you for sharing your performance benchmarks. They seem similar to what I am observing on the desktop I was testing this on.

If I understand it correctly, are all the objects inserted into the database - that is every single annotation object with corresponding boundaries? I guess that is done to be able to easily modify single objects and whatnot, but wouldn't it be a lot faster to simply store the JSON file externally and referring to it from the database? Similarly as done when importing WSIs to a collection. But I guess for practical reasons, such as for annotation work and for versioning, it is better to have all objects stored in the database, or am I missing something?

Anyways, I'm fine with the current delay. I'm more interested in having good support for the cellSens VSI format.

dgutman commented 1 year ago

As you said .. every object can be versioned as well as edited or relabeled ..that was by design

On Mon, Mar 6, 2023, 1:05 PM André Pedersen @.***> wrote:

Thank you for sharing your performance benchmarks. They seem similar to what I am observing on the desktop I was testing this on.

If I understand it correctly, are all the objects inserted into the database - that is every single annotation object with corresponding boundaries? I guess that is done to be able to easily modify single objects and whatnot, but wouldn't it be a lot faster to simply store the JSON file externally and referring to it from the database? Similarly as done when importing WSIs to a collection. But I guess for practical reasons, such as for annotation work and for versioning, it is better to have all objects stored in the database, or am I missing something?

Anyways, I'm fine with the current delay. I'm more interested in having good support for the cellSens VSI format.

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/digital_slide_archive/issues/255#issuecomment-1456668520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTRS22BSDIILFBPWWALW2YROJANCNFSM6AAAAAAVIDO5IE . You are receiving this because you are subscribed to this thread.Message ID: <DigitalSlideArchive/digital_slide_archive/issues/255/1456668520@ github.com>

andreped commented 1 year ago

I'd have to hunt through the metadata that bioformats exposes to find where it exposes the microns per pixel or the magnification. Sadly, bioformats is not as consistent between formats as one could wish, so this still ends up having some code for each format read.

@manthey I found a way to parse the magnification level from the cellSens VSI format. See gist.

I parse the OME-XML metadata of the VSI file to get the NominalMagnification values of all the Objective lenses used. However, I was surprised to find that for two lenses, the final value was the total magnification level, which in this case was 20 (at least according to what QuPath is reporting). I was expecting the values to be 2 and 10, such that the total magnification was 2 x 10 = 20. However, I guess thats what they mean by nominal. Hence, it seems like consecutive lenses have magnifications for the current state given the previous lenses used. Hence, getting the magnification of the final lens (or just the maximum magnification) should yield the magnification used to capture the image.

For convenience, I also shared the implementation below:

import javabridge as jb
import bioformats as bf
import xmltodict

def get_vsi_magnification(wsi_path):

    # get OME XML from VSI format
    ome_xml = bf.get_omexml_metadata(wsi_path)

    # convert XML to human readable format (dict)
    metadata = dict(xmltodict.parse(ome_xml))

    # extract objective information (can be more than one lens/objective)
    curr = metadata["OME"]["Instrument"]["Objective"]

    # extract magnification for each objective
    magnifications = [eval(obj["@NominalMagnification"]) for obj in curr]

    # assumption: the highest magnification value is the total magnification
    return max(magnifications)

if __name__ == "__main__":

    # start virtual machine
    jb.start_vm(class_path=bf.JARS, max_heap_size="4G")

    magn = get_vsi_magnification("/path/to/image.vsi")
    print(magn)

    # kill virtual machine
    jb.kill_vm()

manthey commented 1 year ago

See https://github.com/girder/large_image/pull/1074 which uses data exposed a different way but should work.

andreped commented 1 year ago

See girder/large_image#1074 which uses data exposed a different way but should work.

Oh, OK. I was not aware that you have found a fix for that issue. Then I guess the final thing that remains is being able to upload the corresponding _ID_/ directory with the ID.vsi file somehow.

andreped commented 1 year ago

What's the status on this issue? That is having proper cellSens VSI support during compute with HistomicsTK and girder worker.

@markusdrange who made the issue https://github.com/DigitalSlideArchive/digital_slide_archive/issues/257, also needs this to work for his Master's thesis project.

andreped commented 1 year ago

Minor bump @manthey @dgutman.

This is a feature that is of great interest for our project as most of our WSIs are stored in the .vsi format.

DigitalSlideArchive / digital_slide_archive

Bug: Nuclei detection on cellSens VSI fails #255