bioio-devs / bioio

Image reading, metadata management, and image writing for Microscopy images in Python

https://bioio-devs.github.io/bioio/OVERVIEW.html

BSD 3-Clause "New" or "Revised" License

58 stars 4 forks source link

HTTP URL for ZARR fails to be validated by `determine_plugin` for bioio-ome-zarr #41

Closed pgarrison closed 7 months ago

pgarrison commented 7 months ago

Description: see title

Code to reproduce

from bioio import BioImage
path = "https://allencell.s3.amazonaws.com/aics/nuc_morph_data/data_for_analysis/baseline_colonies/20200323_09_small/raw.ome.zarr"
image = BioImage(path)  # FileNotFoundError
print(image.get_image_dask_data())

Workaround

If the OME ZARR reader is specified explicitly, determine_plugin is bypassed and the image can be loaded. bioio.BioImage(path, reader=bioio_ome_zarr.Reader).

Environment

I'm testing against bioio-ome-zarr PR 17 which improves support for reading from S3.

toloudis commented 7 months ago

So here's my take: pathlike_to_fs is being called with enforce_exists True. The filesystem is correctly deduced to be HTTPFileSystem.

However, the path to the zarr is a directory, which fails the fs.exists function call. If somehow we could append .zgroup or .zattrs then the existence check would pass and we'd be in good shape.

Currently this function call is not reader-specific. The readers don't get a chance to modify the URL or add extra kwargs before pathlike_to_fs is called. This seems to be a remnant of the assumption that the url is always a file and not a directory - we'd be able to globally check for the existence independent of any Reader implementation.

So, one question is: should we loop over each Reader, and let the ones that pass our test of potentially supporting the file type, each do their own existence check? (With maybe a default fallback in bioio-base.Reader?)

evamaxfield commented 7 months ago

So here's my take: pathlike_to_fs is being called with enforce_exists True. The filesystem is correctly deduced to be HTTPFileSystem.

However, the path to the zarr is a directory, which fails the fs.exists function call. If somehow we could append .zgroup or .zattrs then the existence check would pass and we'd be in good shape.

Is this because fs.exists() only returns True if the path is a file and not a directory? (I sort of assume so because blob storage is all fake directories anyway.)

In which case completely agree that we should check for a subfile. Completely forgot that fsspec interacts this way.

evamaxfield commented 7 months ago

So, one question is: should we loop over each Reader, and let the ones that pass our test of potentially supporting the file type, each do their own existence check? (With maybe a default fallback in bioio-base.Reader?)

I think this is good use case for having a base Reader impl and allowing for custom Reader impls that we just call to.

pgarrison commented 7 months ago

So here's my take: pathlike_to_fs is being called with enforce_exists True. The filesystem is correctly deduced to be HTTPFileSystem. However, the path to the zarr is a directory, which fails the fs.exists function call. If somehow we could append .zgroup or .zattrs then the existence check would pass and we'd be in good shape.

Is this because fs.exists() only returns True if the path is a file and not a directory? (I sort of assume so because blob storage is all fake directories anyway.)

Yes, at least when accessing S3 via the HTTPFileSystem. I think s3fs does not have this issue.

pgarrison commented 7 months ago

So, one question is: should we loop over each Reader, and let the ones that pass our test of potentially supporting the file type, each do their own existence check? (With maybe a default fallback in bioio-base.Reader?)

I notice that bioio already loops over each Reader to call is_supported_image, so it seems redundant to have two loops, one to check if the file exists and another to check if it's supported.

toloudis commented 7 months ago

So, one question is: should we loop over each Reader, and let the ones that pass our test of potentially supporting the file type, each do their own existence check? (With maybe a default fallback in bioio-base.Reader?)

I notice that bioio already loops over each Reader to call is_supported_image, so it seems redundant to have two loops, one to check if the file exists and another to check if it's supported.

It did occur to me to have enforce_exists false in the first check, and then in the per-reader check we can do special things like appending .zattrs or .zgroup to the path.