read images and annotation directly from s3 bucket

DIAGNijmegen / pathology-whole-slide-data

A package for working with whole-slide data including a fast batch iterator that can be used to train deep learning models.

https://diagnijmegen.github.io/pathology-whole-slide-data/

Apache License 2.0

86 stars 24 forks source link

read images and annotation directly from s3 bucket #24

Closed rj678 closed 2 years ago

rj678 commented 2 years ago

thank you for releasing this super helpful repo! I think it'd be great to be able to read all of the data directly from the s3 bucket, so users don't have to download the data. I've made a notebook available on a forked branch that shows an example:

https://github.com/rj678/pathology-whole-slide-data/blob/test_s3_read/notebooks/s3_image_annotation.ipynb

the good folks at Bayer have released an openslide replacement that is fsspec compatible so we can read images from s3 buckets: https://github.com/bayer-science-for-a-better-life/tiffslide

I've pushed a few changes to the branch test_s3_read on the forked repo where I've added a tiffslide backend, and made a few changes to read annotations from an s3 bucket

please let me know if you think this functionality would be useful to add to the repo, and I can clean up the commits and the code and raise a PR to get your feedback. I plan on making more changes to train a model without having to download any raw data.

thanks again,

martvanrijthoven commented 2 years ago

Dear Rishi

Thank you for your kind words about this package! Adding a new image backend is always great. I think something you propose would be very nice to add to this package.

Here are just some ideas that I had after I peeked into your branch:

I noticed that the current openslide backend and the backend for tifflslide have some redundancy. It would be great if we could resolve that a bit. And I think we can add the S3 loading of the annotation files in a subclass of WholeSlideAnnotationFile such that it is a bit more general. Not sure yet how it should be implemented best.

One other point that I already would like to make is that it would be costly and inefficient (training time) if someone would train by reading directly from S3 via deep learning machines other than AWS services like SageMaker. So I think it would be good to clarify this somehow (maybe using a warning?).

Please feel free to open a pull request, and we can discuss this addition in more detail.

Best wishes, Mart

rj678 commented 2 years ago

thanks so much for looking into this, and for your feedback - I'll follow up on all of your points and push new commits later this week to get your feedback

rj678 commented 2 years ago

Hello Mart,

sorry for the delay in getting back on this. I've cleaned up the code a bit and created a draft PR, with TODOs based on your feedback. I'll follow up on the TODOs.

I added a notebook that can be run on colab where users can read images directly from the s3 bucket. I added a note on reading and training directly from s3 being slow. I was planning on using this s3 plugin for PyTorch for training: https://github.com/aws/amazon-s3-plugin-for-pytorch , which is described as a high-performance library and has been upstreamed into the torchdata package.

Thanks for being open to this contribution - looking forward to committing more code as per your standards so the PR gets accepted :)

martvanrijthoven commented 2 years ago

Dear Rishi

Thank you for your work! Yes, the s3 plugin looks nice. Please let me know if you want me to review something. Looking forward to this nice addition.

Best wishes, Mart

rj678 commented 2 years ago

thanks, Mart. I have a quick clarification about the backend for the annotation parser:

I don't have asap installed on my laptop (I use a conda env on WSL2). When I try to use the asap Image backend, I get the error as expected: Registrant is not found.... , but I am able to use the asap annotation parser: wsa = WholeSlideAnnotation(path_to_xml, parser='asap').

On Colab though, when I try to use the asap annotation parser, I get the error: Registrant 'asap' is not found in the register of class 'AnnotationParser' with registrant names ('wsa', 'mask', 'virtum-asap')

is there some library that the wholeslidedata library is checks for to use the asap annotation parser - once I ID that, I will install that on Colab as well,

thank you,

martvanrijthoven commented 2 years ago

Dear Rishi, No additional software needs to be installed for the asap annotation parser. I can not reproduce your issue on Colab. Could you please share the code that you have tried on Colab, Then i will look into it.

rj678 commented 2 years ago

Hi Mart, this notebook can be imported into Colab and has the error in the last cell:

https://github.com/rj678/pathology-whole-slide-data/blob/test_s3_read/notebooks/colab_s3_image_annotation.ipynb

thank you for helping me resolve this error

martvanrijthoven commented 2 years ago

I think you did not install boto3, which makes the import of the AsapAnnotationParser fail and therefore it can not find the parser.

I think, running the following should solve your problem: !pip install boto3

Please let me know if you still have problems.

Best wishes, Mart

rj678 commented 2 years ago

thanks so much, Mart - that fixed it! I'll follow up on the feedback on the PR.