Remotely read/stream embargoed Zarr

kabilar commented 3 weeks ago

Hi team, @aaronkanzer and I are trying to read the metadata and chunks of an embargoed Zarr (on DANDI and LINC) and are unable to. What would be the best approach to remotely access a Zarr that is apart of an embargoed Dandiset?

For the code snippet below, I can get the zarr_path in the dandiarchive S3 bucket from the File Browser of a Dandiset using the View Asset Metadata button, but s3fs also requires AWS credentials.

import zarr, s3fs

access_key = 'your-access-key-id'
secret_key = 'your-secret-access-key'
session_token = 'your-session-token'  # Optional, if using temporary credentials

s3 = s3fs.S3FileSystem(key=access_key, secret=secret_key, token=session_token)

bucket_name = 'dandiarchive'
zarr_path = 'path/to/your/zarr/data.zarr'

store = s3fs.S3Map(root=f'{bucket_name}/{zarr_path}', s3=s3, check=False)

zarr_array = zarr.open_array(store, mode='r')

For reference, I am also hitting a blocker using the DANDI API when trying to access a public or private Zarr. I am not sure if this would be related to my use case. Using the code snippet below (which is a derivative of the OpenScope Databook streaming section) I receive a Response [400]. I presume that this is because the asset is a Zarr and the response is set in lines 143-148. And perhaps this is related to https://github.com/dandi/dandi-cli/issues/1455.

from dandi import dandiapi

dataset = "000026"
filepath = "sub-I58/ses-Hip-CT/micr/sub-I58_sample-01_chunk-01_hipCT.ome.zarr"
dandi_api_key = <dandi_api_key>

client = dandiapi.DandiAPIClient(api_url="https://api.dandiarchive.org/api", token=dandi_api_key)

my_dandiset = client.get_dandiset(dandiset_id=dataset, version_id="draft")

file = my_dandiset.get_asset_by_path(filepath)

base_url = file.client.session.head(file.base_download_url)

Thank you.

jwodder commented 3 weeks ago

How are you even creating a Zarr in an embargoed Dandiset? dandi-cli currently doesn't support that, and — last time I checked — neither does the Archive.

As to your second snippet, base_download_url and download_url are useless for Zarrs, as they normally point to a single file, but a Zarr is many files. What exactly were you expecting to happen there?

yarikoptic commented 3 weeks ago

Indeed, I thought zarrbargo is yet to be implemented, correct @jjnesbitt ?

jjnesbitt commented 3 weeks ago

Indeed, I thought zarrbargo is yet to be implemented, correct @jjnesbitt ?

Correct, it is not yet implemented.

kabilar commented 3 weeks ago

Thanks team. Sorry, I was a bit loose with my explanation. The first code snippet does not work on LINC, which requires authentication for all requests since the platform is private. (LINC does allow for upload and download of private Zarrs.) The second code snippet does not work on both DANDI and LINC. And as you mentioned, this may not be needed for my use case.

Overall I am just looking for advice on how we should provide LINC users read/streaming access to private Zarrs. We were thinking about creating a helper function (get_read_only_credentials) to work with s3fs as shown below.

aws_credentials = lincbrain.get_read_only_credentials(lincbrain_api_key=<lincbrain_api_key>)

s3 = s3fs.S3FileSystem(key=aws_credentials['access_key'], 
                       secret=aws_credentials['secret_key'], 
                       token=aws_credentials['session_token'])

kabilar commented 3 weeks ago

And thanks for clarifying. I forgot that zarrbargo has not yet been implemented.

satra commented 3 weeks ago

on the 000108 examples in the example-notebooks, the code reads and evaluates zarr objects on dandiarchive.

kabilar commented 3 weeks ago

Thank you. I will take a look at these examples.

dandi / dandi-cli

Remotely read/stream embargoed Zarr #1491