Open tcompa opened 3 days ago
Hi @tcompa,
Thanks for the example. I have played a bit with it, and it seems that fully supporting fsspec
stores in ngio is not going to be too hard:
It took only a few minor changes #10
from ngio import NgffImage
import matplotlib.pyplot as plt
import fsspec.implementations.http
fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
"https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
"20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/"
)
store = fs.get_mapper(url)
ngff_image = NgffImage(store)
print(f"list of images: {ngff_image.levels_paths}")
image = ngff_image.get_image(path="2")
print(f'list labels: {ngff_image.label.list()}')
nuclei = ngff_image.label.get_label("nuclei")
print(f"nuclei: {nuclei.get_array(mode='dask').shape}")
print(f"image: {image.get_array(mode='dask').shape}")
Should produce:
list of images: ['0', '1', '2', '3', '4']
list labels: ['nuclei']
nuclei: (1, 540, 1280)
image: (3, 1, 540, 1280)
The only part where support would require a bit more work is the tables.
The reason is that I rely on the Zarr.Group.groups
methods to validate the coherency between metadata and disk.
This does not work on remote storage, so I must change my approach.
In concrete if I try:
import fsspec.implementations.http
import zarr
import dask.array as da
fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
"https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
"20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/"
)
store = fs.get_mapper(url)
group_zarr = zarr.open_group(store)
print("list of subgroups:", list(group_zarr.groups()))
print("list of arrays:", list(group_zarr.arrays()))
I won't find the subgroups and sub-arrays.
list of subgroups: []
list of arrays: []
I can see two strategies:
It took only a few minor changes https://github.com/fractal-analytics-platform/ngio/pull/10
This is very encouraging!
Without any deep knowledge of ngio from my side, it seems nice to be able to "inject" some fsspec native object without ngio knowing too much about fsspec itself.
The only part where support would require a bit more work is the tables. The reason is that I rely on the Zarr.Group.groups methods to validate the coherency between metadata and disk. This does not work on remote storage, so I must change my approach.
If you look at https://zarr.readthedocs.io/en/stable/_modules/zarr/hierarchy.html#Group.groups, you'll see that there is a different behavior for zarr v2 and v3.
zarr.storage.listdir
function (https://zarr.readthedocs.io/en/stable/_modules/zarr/storage.html#listdir) directly.group_keys
(which eventually still calls listdir
, but it's within a more complex code block and I did not dig too deep).Just to make sure: is this another instance of https://github.com/zarr-developers/zarr-python/issues/1568?
In that case, there is no obvious way out - see these quotes from that issue:
Since HTTP can't really do file listing (except a few special cases derived from FTP), it can only be used with datasets that have consolidated metadata. Without [consolidated] metadata, zarr needs listing to know what arrays are contained in a group.
does this store have a consolidated metadata object (.zmetadata) at its root? Without it, it won't be able to list members.
To rephrase it as a more concrete comment:
groups
issue probably requires understanding what listdir(store, path)
does - at least for a couple of stores (the str
one and the http one).Thanks for the resources!
I did not know about consolidated metadata, but it is a great way to group all metadata in a single place. We should have ngio calling consolidate every time we create a new element in the Zarr hierarchy. This would make large plate metadata parsing much more efficient.
I think, for now, it's ok just to avoid relying on Zarr internals to discover groups and arrays. This logic will be heavily refactored when we switch to v3 anyway.
I have only a small additional question: should ngio be agnostic to auth? I can foresee two cases:
I have only a small additional question: should ngio be agnostic to auth? I can foresee two cases:
* ngio is instantiated with an authenticated store (like in the example) and knows nothing about auth. * ngio deals with the authentication internally
In my opinion, at first I would stick with option 1 (ngio knows nothing about authentication, but it can use an arbitrary fsspec store).
The complex part of option 2, in my view, would be the following: In order to set the authentication parameters from within ngio, you'd need to implement logic to decide which fsspec model to use (HTTPFileSystem? other?). If I remember correctly, this is also done in other libraries (zarr and/or dask), meaning there would be a lot of room for either redundant or conflicting logic. It's much easier if ngio can be agnostic and just send the "store" (either a simple path/URL or a full-fledged fsspec object) to the loader.
To put the question in a broader context: where will it be relevant for ngio to use specific fsspec objects (e.g. the HTTPFileSystem one)? This question is independent on the specific case of auth-related additional parameters, as there could exist different configuration parameters.
Relevant use cases:
s3://
URL? Tests in https://github.com/fractal-analytics-platform/localstack-aws-zarr suggested that an s3://
URL is handled fine by both zarr and dask.Understanding these use cases better would help you decide whether it's relevant for ngio to integrate the creation of fsspec objects.
The main goal of this explorative issue is to read remote zarrs over HTTP, when this HTTP calls require some authentication/authorization. I would postpone thinking about supporting write operations, especially because I cannot say whether it's a relevant use case (would someone really operate over HTTP, apart from the use case of reading existing datasets?)
The simplest example I can come up with is inspired e.g. on https://github.com/zarr-developers/zarr-python/issues/1568, https://github.com/zarr-developers/zarr-python/issues/993, https://github.com/pangeo-forge/pangeo-forge-recipes/issues/222 (and the pangeo one is an interesting view into more integrated use cases of globus).
Starting from fsspec.implementations.http.HTTPFileSystem, we can include a
client_kwargs
argument which is then passed to the underlying aiohttp.ClientSession calls. An example from thefsspec
docs isTo use
HTTPFileSystem
for a zarr array either via zarr-python or dask.array, we can proceed as inwith output
Given such minimal example, the question is whether this could fit anywhere in ngio. To phrase it differently: is it relevant/worth for ngio to integrate fsspec? I do not know ngio well enough for answering.
Next steps, in my understanding: