read images from s3 bucket

rj678 commented 2 years ago

Draft PR to add functionality to read images directly from the s3 bucket.

Features:

[ ] add tiffslide backend to read images directly from s3
[ ] read annotation files directly from s3 with asap parser
[ ] run notebooks on Colab

Additional Libraries required:

[ ] tiffslide
[ ] s3fs
[ ] boto3
[ ] urllib

TODOs based on initial feedback:

[ ] add the S3 loading of the annotation files in a subclass of WholeSlideAnnotationFile such that it is a bit more general
[ ] reduce redundancy between tiffslide backend and openslide backend
[ ] add tests for new features

martvanrijthoven commented 2 years ago

Thank you for creating this pull request and the cool TODO list,

Just already a comment. I think it would be nice to add this feature in its own dedicated accessories folder. The accessories folder contains code for external software/packages etc to keep everything well organized. I think it would be great to have an s3 folder in accessories where we can put all your work,

Best wishes, Mart

rj678 commented 2 years ago

Hi Mart, I've separated out the code to read images and annotations into the accessories folder (separate folders for images and annotations within accessories) - I'll clean up the code, but thought I'd get your feedback and check if this is along the lines of what you recommended, thanks

martvanrijthoven commented 2 years ago

This already looks pretty cool!

I have some comments though. I don't think we need to change the WholeSlideAnnotation class by introducing a storage_source. The only problem is that the path does not exists. But i think we can solve this, if we check what kind of parser we give. For example, we can create a class CloudAnnotationParser and check if the given parser is of that type and then we check if the path exists in another way using the request lib maybe. We can make a class method for that in the CloudAnnotationParser Also, I think it would be nice if we make an S3AnnotationParser, for now we can start with an S3AsapAnnotationParser that inherits from CloudAnnotationParser and AsapAnnotationParser. We have to change the AsapAnnotationParser a bit. I think we can make a method _open_annotation, that opens the annotations, For the S3AsapAnnotationParser, we then only need to override this method with the code that you already made. What do you think about this?

I can also add some code if you like. You can allow me to do so as described here. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork

Either way is fine for me :).

Please let me know if you have any questions.

rj678 commented 2 years ago

Thanks so much for the detailed feedback, Mart - I'll follow up on your feedback. you should be able to commit code to this PR's branch - it'd be amazing if you could add some code - looking forward to collaborating on this!

rj678 commented 2 years ago

Hi Mart,

I've pushed the following changes:

create CloudAnnotationParser with class method that checks if the s3 file exists

create method _open_annotation in the existing asap parser

create and register S3AsapAnnotationParser that inherits from CloudAnnotationParser and AsapAnnotationParser and overrides the _open_annotation method

I've tested the changes - the code can be cleaned up, but it'd be great to get your feedback on whether this is along the lines of what you expected, thank you

martvanrijthoven commented 2 years ago

Dear Rishi,

Thank you so much for your work!

I made some small changes and pushed them. Please check them and if you agree I think this feature can go into the main branch. We might reduce redundancy between the OpenSlide backend and the TiffSlide backend, but we can look a this later. I think for now this is fine.

rj678 commented 2 years ago

Hi Mart, thanks so much for the commits - please let me know if you'd like any further changes for now - I plan to use this to train a model by reading data directly from s3 next, thanks so much again!

martvanrijthoven commented 2 years ago

Dear Rishi,

I think this is a great contribution. I will merge this pull request. I tested your notebook locally, and the loading of images from s3 took quite some time. Please note if you train locally with the new image backend, it is pretty slow, and I highly recommend only using the data directly from AWS if you use something like SageMaker. Furthermore, I noticed that on google colab, you used python3.7. Please also note that the batch iterator requires at least python3.8 because it uses shared memory to speed up the loading of the data.

I will merge your pull request now. Thanks again for your contribution!

rj678 commented 2 years ago

thanks so much - I'll keep these points in mind - looking forward to using this library some more!

DIAGNijmegen / pathology-whole-slide-data

read images from s3 bucket #28