fractal-analytics-platform / ngio

NGIO is a Python library to streamline OME-Zarr image analysis workflows.
https://fractal-analytics-platform.github.io/ngio/
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

Support reading remote zarrs via authenticated HTTP calls #9

Open tcompa opened 3 days ago

tcompa commented 3 days ago

The main goal of this explorative issue is to read remote zarrs over HTTP, when this HTTP calls require some authentication/authorization. I would postpone thinking about supporting write operations, especially because I cannot say whether it's a relevant use case (would someone really operate over HTTP, apart from the use case of reading existing datasets?)


The simplest example I can come up with is inspired e.g. on https://github.com/zarr-developers/zarr-python/issues/1568, https://github.com/zarr-developers/zarr-python/issues/993, https://github.com/pangeo-forge/pangeo-forge-recipes/issues/222 (and the pangeo one is an interesting view into more integrated use cases of globus).

Starting from fsspec.implementations.http.HTTPFileSystem, we can include a client_kwargs argument which is then passed to the underlying aiohttp.ClientSession calls. An example from the fsspec docs is

client_kwargs = {'auth': aiohttp.BasicAuth('user', 'pass')}

To use HTTPFileSystem for a zarr array either via zarr-python or dask.array, we can proceed as in

import fsspec.implementations.http
import zarr
import dask.array as da

fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
    "https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
    "20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/0"
)
store = fs.get_mapper(url)

array_zarr = zarr.open_array(store)
print(f"{array_zarr=}")

array_dask = da.from_zarr(store)
print(f"{array_dask=}")

with output

array_zarr=<zarr.core.Array (3, 1, 2160, 5120) uint16>

/somewhere/venv/lib/python3.10/site-packages/zarr/creation.py:614: UserWarning: ignoring keyword argument 'read_only'
  compressor, fill_value = _kwargs_compat(compressor, fill_value, kwargs)

array_dask=dask.array<from-zarr, shape=(3, 1, 2160, 5120), dtype=uint16, chunksize=(1, 1, 2160, 2560), chunktype=numpy.ndarray>

Given such minimal example, the question is whether this could fit anywhere in ngio. To phrase it differently: is it relevant/worth for ngio to integrate fsspec? I do not know ngio well enough for answering.


Next steps, in my understanding:

  1. Understand how much ngio is (or can be) integrated with fsspec, and how costly it would be to have an abstraction that propagates user-provided kwargs to the fsspec store.
  2. If we want to proceed, setup a simple test server that serves zarrs over HTTP with some authentication required (see upcoming issue).
lorenzocerrone commented 2 days ago

Hi @tcompa,

Thanks for the example. I have played a bit with it, and it seems that fully supporting fsspec stores in ngio is not going to be too hard:

It took only a few minor changes #10

from ngio import NgffImage
import matplotlib.pyplot as plt

import fsspec.implementations.http

fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
    "https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
    "20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/"
)
store = fs.get_mapper(url)

ngff_image = NgffImage(store)
print(f"list of images: {ngff_image.levels_paths}")
image = ngff_image.get_image(path="2")

print(f'list labels: {ngff_image.label.list()}')
nuclei = ngff_image.label.get_label("nuclei")

print(f"nuclei: {nuclei.get_array(mode='dask').shape}")
print(f"image: {image.get_array(mode='dask').shape}")

Should produce:

list of images: ['0', '1', '2', '3', '4']
list labels: ['nuclei']
nuclei: (1, 540, 1280)
image: (3, 1, 540, 1280)

The only part where support would require a bit more work is the tables. The reason is that I rely on the Zarr.Group.groups methods to validate the coherency between metadata and disk. This does not work on remote storage, so I must change my approach.

In concrete if I try:

import fsspec.implementations.http
import zarr
import dask.array as da

fs = fsspec.implementations.http.HTTPFileSystem(client_kwargs={})
url = (
    "https://raw.githubusercontent.com/tcompa/hosting-ome-zarr-on-github/refs/heads/main/"
    "20200812-CardiomyocyteDifferentiation14-Cycle1_mip.zarr/B/03/0/"
)
store = fs.get_mapper(url)

group_zarr = zarr.open_group(store)
print("list of subgroups:", list(group_zarr.groups()))
print("list of arrays:", list(group_zarr.arrays()))

I won't find the subgroups and sub-arrays.

list of subgroups: []
list of arrays: []

I can see two strategies:

  1. rely on metadata only (simple)
  2. rely on metadata only for remote stores and validate coherency on disk
tcompa commented 2 days ago

It took only a few minor changes https://github.com/fractal-analytics-platform/ngio/pull/10

This is very encouraging!

Without any deep knowledge of ngio from my side, it seems nice to be able to "inject" some fsspec native object without ngio knowing too much about fsspec itself.

tcompa commented 2 days ago

The only part where support would require a bit more work is the tables. The reason is that I rely on the Zarr.Group.groups methods to validate the coherency between metadata and disk. This does not work on remote storage, so I must change my approach.

If you look at https://zarr.readthedocs.io/en/stable/_modules/zarr/hierarchy.html#Group.groups, you'll see that there is a different behavior for zarr v2 and v3.


Just to make sure: is this another instance of https://github.com/zarr-developers/zarr-python/issues/1568?

In that case, there is no obvious way out - see these quotes from that issue:

Since HTTP can't really do file listing (except a few special cases derived from FTP), it can only be used with datasets that have consolidated metadata. Without [consolidated] metadata, zarr needs listing to know what arrays are contained in a group.

does this store have a consolidated metadata object (.zmetadata) at its root? Without it, it won't be able to list members.


To rephrase it as a more concrete comment:

lorenzocerrone commented 1 day ago

Thanks for the resources!

I did not know about consolidated metadata, but it is a great way to group all metadata in a single place. We should have ngio calling consolidate every time we create a new element in the Zarr hierarchy. This would make large plate metadata parsing much more efficient.

I think, for now, it's ok just to avoid relying on Zarr internals to discover groups and arrays. This logic will be heavily refactored when we switch to v3 anyway.

I have only a small additional question: should ngio be agnostic to auth? I can foresee two cases:

tcompa commented 1 day ago

I have only a small additional question: should ngio be agnostic to auth? I can foresee two cases:

* ngio is instantiated with an authenticated store (like in the example) and knows nothing about auth.
* ngio deals with the authentication internally

In my opinion, at first I would stick with option 1 (ngio knows nothing about authentication, but it can use an arbitrary fsspec store).

The complex part of option 2, in my view, would be the following: In order to set the authentication parameters from within ngio, you'd need to implement logic to decide which fsspec model to use (HTTPFileSystem? other?). If I remember correctly, this is also done in other libraries (zarr and/or dask), meaning there would be a lot of room for either redundant or conflicting logic. It's much easier if ngio can be agnostic and just send the "store" (either a simple path/URL or a full-fledged fsspec object) to the loader.


To put the question in a broader context: where will it be relevant for ngio to use specific fsspec objects (e.g. the HTTPFileSystem one)? This question is independent on the specific case of auth-related additional parameters, as there could exist different configuration parameters.

Relevant use cases:

  1. Loading data over HTTP, for a viewer plugin.
  2. Writing data over HTTP? I don't think this will ever be relevant.
  3. Accessing data over s3:
  4. Any other?

Understanding these use cases better would help you decide whether it's relevant for ngio to integrate the creation of fsspec objects.