Add support for .zarr "files" upload

yarikoptic commented 2 years ago

zarr upload support was added to staging in https://github.com/dandi/dandi-api/pull/586
@dchiquito on slack:
- The staging swagger docs now have a zarr section (https://api-staging.dandiarchive.org/swagger/) and https://github.com/dandi/dandi-api/blob/master/doc/design/zarr-support-3.md#upload-flow describes the intended flow.
- https://github.com/dandi/dandi-api/blob/master/dandiapi/api/tests/test_zarr_upload.py could be used as example implementations
I think we can for starters assume that a directory with .zarr extension is a zarr archive
- @satra -- is there a better (and a reliable) indicator to judge that some directory is a zarr "file"? I guess we could look for some .zarray or .zgroup but I wonder if that would be reliable enough?

jwodder commented 2 years ago

@dchiquito Questions on the zarr upload API:

The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?
Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?
The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?
For POST /zarr/, what exactly should the "name" field in the request body be set to?
".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories?
"Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used?
- Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.
Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?
The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.
How do you download a Zarr?

satra commented 2 years ago

use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: https://github.com/zarr-developers/zarr-python/issues/912

jwodder commented 2 years ago

@dchiquito I'm trying to upload a trivial test zarr with the following code:

#!/usr/bin/env python3
import json
import os
from pathlib import Path
import sys
from dandi.dandiapi import DandiAPIClient, RESTFullAPIClient
from dandi.support.digests import get_dandietag
from dandi.utils import find_files

dandiset_id = sys.argv[1]
zarrdir = Path(sys.argv[2])
if zarrdir.suffix != ".zarr" or not zarrdir.is_dir():
    sys.exit(f"{zarrdir} is not a zarr directory")

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    r = client.post("/zarr/", json={"name": zarrdir.name})
    zarr_id = r["zarr_id"]
    zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
    upload_body = []
    for zf in zfiles:
        upload_body.append({
            "path": zf.relative_to(zarrdir).as_posix(),
            "etag": get_dandietag(zf).as_str(),
        })
    r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
    with RESTFullAPIClient("http://nil.nil") as storage:
        for upspec in r:
            with (zarrdir / upspec["path"]).open("rb") as fp:
                storage.put(upspec["upload_url"], data=fp, json_resp=False)
    r = client.post(f"/zarr/{zarr_id}/upload/complete/")
    #print(json.dumps(r, indent=4))
    d = client.get_dandiset(dandiset_id, "draft", lazy=False)
    r = client.post(
        f"{d.version_api_path}assets/",
        json={"metadata": {"path": zarrdir.name}, "zarr_id": zarr_id},
    )
    print(json.dumps(r, indent=4))

but I'm getting a 403 response when PUTting the files to S3:

<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>Y16M1QN13AM6769S</RequestId><HostId>4uYw6UzqgIdTUIhqRXqW/PDmD6lXmIv53YYPHJFYv/u2rKi62895bV6jSCqrJCKF3qhJZJI1RCw=</HostId></Error>

jwodder commented 2 years ago

@satra

files will not necessarily have zarr extension

Then how do we tell whether something is a zarr directory or not?

satra commented 2 years ago

Then how do we tell whether something is a zarr directory or not?

open it as a NestedDirectoryStore in read mode with the zarr-python library. it should be able to give you information about groups and shapes, and metadata. we would essentially validate based on those and for the moment write our own requirements for the ngff files till the ome-zarr-py library implements those. but since this is generic zarr to start with, i would use the xarray or zarr python libraries to read the dataset.

jwodder commented 2 years ago

@satra So we just try to open every single directory with the zarr library and see it if succeeds?

yarikoptic commented 2 years ago

@satra:

use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912

@jwodder

So we just try to open every single directory with the zarr library and see it if succeeds?

I think we should formalize "supported zarr directory name extensions" to a limited set one way or another
We can't rely on "sensing" each directory alone since e.g. it alone would probably not tell us (if errors out) if a directory potentially contains a (broken) zarr archive or just a collection of files
But we must not allow for upload of "heavy trees" as any zarr would likely to be -- we would flood with thousands of assets which would only need to be removed

Thus I think we should

do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff
for any of those do test if they are "ok" by trying to open. If not -- error/skip
if we detect file tree deeper than some N directories down (5?) -- we error out. Might need to become an option somewhere. In current dandi organizeed scheme (sub-*/{files}) we have only 1 level down. In BIDS (sub-*/ses-*/{modality}/{files}/potentially-onemore) we have 4.

edit 1: I think it is ok to add zarr as a dependency

adds only 2 dependencies -- asciitree and numcodecs (this one seems a bit heavy though). and zarr is in conda-forge

```shell $> pip install zarr Collecting zarr Using cached zarr-2.10.3-py3-none-any.whl (146 kB) Requirement already satisfied: numpy>=1.7 in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (1.21.4) Requirement already satisfied: fasteners in ./venvs/dev3/lib/python3.9/site-packages (from zarr) (0.16.3) Collecting asciitree Using cached asciitree-0.3.3-py3-none-any.whl Collecting numcodecs>=0.6.4 Using cached numcodecs-0.9.1-cp39-cp39-manylinux2010_x86_64.whl (6.4 MB) Requirement already satisfied: six in ./venvs/dev3/lib/python3.9/site-packages (from fasteners->zarr) (1.16.0) Installing collected packages: numcodecs, asciitree, zarr Successfully installed asciitree-0.3.3 numcodecs-0.9.1 zarr-2.10.3 ```

satra commented 2 years ago

do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff

i think for the moment this would be fine.

satra commented 2 years ago

zarr with also require compression codecs to be added.

jwodder commented 2 years ago

@yarikoptic @satra Note that zarr seems to ignore files with invalid/unexpected names, and a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

satra commented 2 years ago

a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

for the moment, let's say no empty groups allowed.

jwodder commented 2 years ago

@yarikoptic

When uploading a Zarr directory, should all files within it be uploaded, even "extra"/unrecognized ones? Note that Zarr does not seem to provide a way to only list recognized files.
Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?

yarikoptic commented 2 years ago

a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?

for the moment, let's say no empty groups allowed.

FWIW, pretty much we need zarr_validate (as if to complement pynwb_validate) which would do all those checks. Then it would be interfaced in validate and upload.

Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?

if per above we just add zarr_validate and since we do allow to upload without validation -- uniform way would be to "allow" users to upload invalid zarrs if they say so via --validation [skip|ignore]

jwodder commented 2 years ago

@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr or .ngff extension as a Zarr directory?

Also, how should metadata be determined for Zarr assets? (cc @satra)

yarikoptic commented 2 years ago

@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr or .ngff extension as a Zarr directory?

Yes

satra commented 2 years ago

for the moment i would limit metadata extraction similar to the bids data, so based on names rather than internal metadata. in the future once we get better ngff metadata we will write additional extractors. i can help with the metadata extraction beyond basic metadata (size, encoding type, subject id). for dandi this would be part of bids datasets, so we will have additional info available for sticking into the asset metadata from participants.tsv and samples.tsv files.

jwodder commented 2 years ago

@yarikoptic FYI: I'm implementing the metadata & validation for Zarr by giving NWB, Zarr, and generic files their own classes with metadata and validation methods; however, fscacher doesn't currently support caching of instance methods, so some caching is going to have to be disabled for now.

yarikoptic commented 2 years ago

hm, do you see how fscacher could gain support for bound methods? if not, I wonder if we shouldn't just concentrate logic in @staticmethods of such classes which would be explicitly passed a path instead of an instance?

jwodder commented 2 years ago

@yarikoptic We would have to add a variant of memoize_path() that gets the affected path from a given attribute of the decorated method's self parameter.

yarikoptic commented 2 years ago

right -- sounds good! somehow it didn't occur to me ;)

dchiquito commented 2 years ago

The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?

Currently there is no limit. Now that we have the code up on a real server I need to do some experimentation to determine that value. The limit is mostly there to enforce that no request takes longer than 30 seconds, since Heroku will forcibly cancel any requests that exceed that timeout.

Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?

The client would need to calculate the checksum to do integrity checking of the zarr upload. I wrote https://github.com/dandi/dandi-api/blob/master/dandiapi/api/zarr_checksums.py to handle all of that logic, and I put it in dandi-api for now for quicker iteration. It should be moved to a common location soon so that the CLI can use it as well.

The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?

That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?

For POST /zarr/, what exactly should the "name" field in the request body be set to?

I would assume that would be the name of the directory containing the zarr data, unless zarr metadata contains a more descriptive name.

".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories? "Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used? Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.

I would say we are using simple MD5, and that for these purposes MD5 and ETag are the same thing. Files in the zarr archive are limited to 5GB so that they can use the simple upload endpoint, and for files uploaded simply (as opposed to multipart), their ETag is the MD5 of the file. DANDI Etags are defined to be the ETags of multipart uploads, which are MD5s with a suffix relating to the number of parts.

Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?

Correct. Multipart upload requires substantially more engineering, so the executive decision was made to cap zarr component files at 5GB. Zarr is relatively easy to simply rechunk smaller if that is exceeded.

The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.

Sorry, my bad. https://github.com/dandi/dandi-api/issues/648

How do you download a Zarr?

That is done directly from S3 using the zarr library. You should be able to pass the S3 path of the zarr directory and zarr will read that directly from S3.

satra commented 2 years ago

That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?

this should be done as is done currently by the CLI for other types of files. the exact details of encodingformat can be worked out in the relevant CLI PR.

dchiquito commented 2 years ago

@jwodder I confirm that I am encountering the same error with your script. I will look closer next week.

dchiquito commented 2 years ago

@jwodder I got the script working against dandi-api-local-docker-tests by changing the ETag from get_dandietag(zf).as_str() to md5(blob).hexdigest(). I'm still getting the 403 errors in staging though, so I am still investigating.

dchiquito commented 2 years ago

@jwodder I missed a spot with AWS permissions, the staging API has been updated. You need to specify X-Amz-ACL: bucket-owner-full-control as a request header for the upload.

My adaptation of your example looks like this:

with DandiAPIClient.for_dandi_instance(
    "dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
    r = client.post("/zarr/", json={"name": zarrdir.name})
    zarr_id = r["zarr_id"]
    zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
    upload_body = []
    for zf in zfiles:
        with open(zf, "rb") as f:
            blob = f.read()
        upload_body.append({
            "path": zf.relative_to(zarrdir).as_posix(),
            "etag": md5(blob).hexdigest(), # Simple MD5 instead of dandi-etag
        })
    r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
    with RESTFullAPIClient("http://nil.nil") as storage:
        for upspec in r:
            with (zarrdir / upspec["path"]).open("rb") as fp:
                storage.put(upspec["upload_url"], data=fp, json_resp=False, headers={"X-Amz-ACL": "bucket-owner-full-control"}) # X-Amz-ACL header is required
    r = client.post(f"/zarr/{zarr_id}/upload/complete/")

jwodder commented 2 years ago

@dchiquito

The client would need to calculate the checksum to do integrity checking of the zarr upload.

Could you elaborate on how this "integrity checking" fits into the upload flow? Exactly what does the client compare its computed value against?

yarikoptic commented 2 years ago

@dchiquito could you please clarify above question of @jwodder ? In the example above you posted I do not see zarr "file" integrity checking anywhere. Is providing it optional???

dchiquito commented 2 years ago

My example is just John's snippet with some corrections, I did not include integrity checking in it.

After uploading the entire zarr archive, the CLI would query https://api.dandiarchive.org/api/zarr/{zarr_id}/ and compare the checksum value to the checksum calculated against the local files. The file contents were checked as part of the upload process, but this checksum check verifies that all of the files were uploaded into the correct places within the archive.

If it fails, I'm not sure what the best path toward recovery would be. Recursively traversing the archive and identifying where in the tree the checksums begin to deviate would be challenging and potentially a lot of information to present. Aborting and retrying the upload would be simple.

satra commented 2 years ago

the nested directory structure on a filesystem is just one way to store zarr data. and it was adopted to reduce the number of files at a given level by using the / separator vs the . separator which used to be the default. at the end of the day, zarr groups (i.e. n-dimensional arrays) are still simply a list of files. so each group should have a fixed depth and pattern.

a few thoughts.

a new upload: this should simply involve ensuring the number of files and their checksums are appropriate. the checksum appropriateness is now baked into each upload. so the CLI would get an error if an upload fails.
a partial upload(s): this is where the CLI would need to do some investigation to determine which parts of the file need to change and this is where the tree checksum should come in handy. this may involve both additions/deletions/replacements. i don't know what the api has in removing a set of files, but again for an s3 store this should simply be removing a specific set of s3 objects.
just like dandi-etag, we should have a function that computes the checksum of a zarr object. folders on s3 don't exist so there is associated etag. this is being computed by the server side validation process. thus the etag computation in dandischema should be extended to return the checksum of a zarr tree. this could of course use the zarr checksum function or something else.
locally a zarr checksum tree could be stored in a dandi-cli cache, but i would suggest a binary object rather than the mechanism daniel used on s3 to not blow up inodes locally.

jwodder commented 2 years ago

@dchiquito Your modified code still isn't working for me; now the PUT fails with a 501 and the response:

<Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>31Q78EZ8BQKMTNBK</RequestId><HostId>IZBDMAB3lKARg+2grxlKWqjOrv3m6wFHf6X3TdUpicJ+Fidrbh3vwHgFGrPBs65E5BprOM7iFcY=</HostId></Error>

EDIT: The problem goes away if I delete an empty file I added to the ZARR dir for testing purposes.

jwodder commented 2 years ago

@dchiquito Observations after being able to upload a Zarr to staging:

I see that the asset structures returned by /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/ now include "blob" and "zarr" keys. Are these what should be used to determine whether an asset is a Zarr? If so, we would have a problem when accessing individual assets by ID, as the individual asset endpoints only return asset metadata, which does not include this information. @satra seems to imply that zarrness should be indicated in the metadata by setting the encodingFormat metadata field to some as-yet-undetermined value; I wouldn't trust this unless the server validates correctness of this field.
- Is the new "blob" key ever of any use/relevance to the client?
The server-generated metadata for a Zarr asset includes a digest with a type of "dandi:dandi-zarr-checksum", yet such a digest type does not seem to be recognized by dandischema; CC @satra.
The non-API contentURL for a Zarr asset is of the form "zarr/9b107765-7011-4027-bead-81b5bf7aa028/"; how would one use this to download a Zarr?

satra commented 2 years ago

@jwodder - can you please share a staging api url for a zarr asset so i can see the current metadata? indeed that should be harmonized with the schema across all assets.

dandi:dandi-zarr-checksum for this to show up, it would need to be added to dandischema, and as with dandi-etag, i would say the implementation for generating this checksum should also be there. can someone send a PR with this?

regarding the encoding format filed an issue here: https://github.com/zarr-developers/zarr-specs/issues/123

jwodder commented 2 years ago

@satra

$ curl -fsSL https://api-staging.dandiarchive.org/api/dandisets/100956/versions/draft/assets/2126d0e6-7733-44ac-9ab0-f9ac197c0507/ | jq .
{
  "id": "dandiasset:2126d0e6-7733-44ac-9ab0-f9ac197c0507",
  "path": "example.zarr",
  "digest": {
    "dandi:dandi-zarr-checksum": "bf5da3a8c757e736ac99f9d0d86106f4"
  },
  "contentUrl": [
    "https://api.dandiarchive.org/api/assets/2126d0e6-7733-44ac-9ab0-f9ac197c0507/download/",
    "zarr/9b107765-7011-4027-bead-81b5bf7aa028/"
  ],
  "identifier": "2126d0e6-7733-44ac-9ab0-f9ac197c0507",
  "contentSize": 862
}

satra commented 2 years ago

thank you @jwodder - indeed under contentUrl that prefix should be a full s3 path. let's use encodingFormat: "application/x-zarr" for now and we can adjust if the zarr folks say otherwise.

@dchiquito - any idea while dandiset size is 0 for that dandiset ? are zarr assets not being included in the asset count ?

filed an api issue here: https://github.com/dandi/dandi-api/issues/674

jwodder commented 2 years ago

@satra So, for Zarr assets, currently the metadata set by the client should be the same as for a generic non-NWB file, but with an encodingFormat of "application/x-zarr", correct?

satra commented 2 years ago

currently the metadata set by the client should be the same as for a generic non-NWB file

that is correct at present.

however, since the zarr file is also a bids file, we should also extract the bids metadata from the filename. pinging @thechymera as this is related to the bids validation work and should be added to dandi cli for augmenting metadata extraction for bids datasets/assets.

jwodder commented 2 years ago

@dchiquito

Can you provide code for downloading a Zarr from Dandiarchive (assuming the lack of a proper S3 URL in contentUrl is fixed)?
I note that trying to access the API /download/ URL listed in a Zarr asset's contentUrl currently returns a 404. Should that URL even be present in the metadata?

satra commented 2 years ago

@jwodder - downloading may involve opening with the zarr library and then saving locally into the filename with a nested directory storage backend. the alternative would involve listing every object in the store with the uuid prefix, which i think would require an additional api endpoint to return paginated lists of all the objects.

dchiquito commented 2 years ago

I see that the asset structures returned by /dandisets/{versionsdandisetpk}/versions/{versions__version}/assets/ now include "blob" and "zarr" keys. Are these what should be used to determine whether an asset is a Zarr? If so, we would have a problem when accessing individual assets by ID, as the individual asset endpoints only return asset metadata, which does not include this information. @satra seems to imply that zarrness should be indicated in the metadata by setting the encodingFormat metadata field to some as-yet-undetermined value; I wouldn't trust this unless the server validates correctness of this field.

My plan was for those properties would be available to determine zarr-ness. The API currently isn't enforcing an encodingType appropriate for zarr files, but it should: https://github.com/dandi/dandi-api/issues/676

Is the new "blob" key ever of any use/relevance to the client?

I don't think so.

The non-API contentURL for a Zarr asset is of the form "zarr/9b107765-7011-4027-bead-81b5bf7aa028/"; how would one use this to download a Zarr?

Coming soon https://github.com/dandi/dandi-api/issues/677

dandi:dandi-zarr-checksum for this to show up, it would need to be added to dandischema, and as with dandi-etag, i would say the implementation for generating this checksum should also be there. can someone send a PR with this?

I added the new digest type in https://github.com/dandi/dandischema/pull/108. I will move the checksum calculation code next week.

Can you provide code for downloading a Zarr from Dandiarchive (assuming the lack of a proper S3 URL in contentUrl is fixed)?

I will do this ASAP next week. It will involve the zarr python client.

jwodder commented 2 years ago

@dchiquito

the executive decision was made to cap zarr component files at 5GB

So is the client supposed to do file-size checking? Zarr component file sizes don't seem to be reported to the API prior to actually uploading.

dchiquito commented 2 years ago

Any S3 request to upload a file >5GB will fail, but I'm not sure of when or what the response code would be. If you want a better error message you could do file-size checking preemptively.

dchiquito commented 2 years ago

Can you provide code for downloading a Zarr from Dandiarchive (assuming the lack of a proper S3 URL in contentUrl is fixed)?

@jwodder I cannot for the life of me figure out how to properly save a zarr group. This is what I have so far:

import zarr

ZARR_ID = '79bb04a0-3f1b-46ae-8821-1071fc69ed6e'
DOWNLOAD_PATH = '/home/daniel/git/dandi-api/download.zarr'

def download_array(path: str, array: zarr.Array):
    zarr.save(f'{DOWNLOAD_PATH}{array.name}', array)

def download_group(path: str, group: zarr.Group):
    for k in group.array_keys():
        download_array(path, group[k])
    for k in group.group_keys():
        download_group(path, group[k])
    # TODO how to save the .zgroup information?

if __name__ == '__main__':
    z = zarr.open(f'https://api-staging.dandiarchive.org/api/zarr/{ZARR_ID}.zarr/')
    download_group(DOWNLOAD_PATH, z)
    print(z.tree())

It will recursively save all of the arrays in the group, which is all of the actual data, but the .zgroup data is lost so zarr can't load the information again.

yarikoptic commented 2 years ago

@jwodder - downloading may involve opening with the zarr library and then saving locally into the filename with a nested directory storage backend. the alternative would involve listing every object in the store with the uuid prefix, which i think would require an additional api endpoint to return paginated lists of all the objects.

Sounds like using zarr library is indeed the easiest way forward, but my gut "resists" a little, since it is quite a heavy use of a library to just download content from the archive. Some concerns/questions I have

z.save

I really hope that we would find some z.save or alike and avoid crafting ad-hoc code since then every user interfacing with API directly would need to redo that.

is it byte-to-byte identical?

is there a guarantee that such zarr.save would produce byte-to-byte identical copy? or there could be some variability in filenaming, or having some timestamps embedded in data, or data files getting recompressed. Would behavior be identical across zarr library version changes? If any effects from above -- our checksum validation would fail, we loose integrity validation

we might need similar traversal of zarr tree during upload:

Please correct me if we are "all ok" in a following hypothetical use case:

during upload local .zarr folder has some extra data files
it validates by zarr just fine since it just ignores extra files
dandi-cli uploads all the files, and provides checksum which is validated by dandi-api to be ok
another user downloads that .zarr via zarr.save which doesn't bother about downloading extra files, checksum would mismatch since locally we would not have those extra files

So, altogether I feel that we might step into a shady territory, which we can at large avoid if we simply extend API to return a list of "subpaths" for zarr to be downloaded just as regular files, and then not worry about anything from above. WDYT?

edit 1: extra aspects which keep coming up

download progress indication: how would we deduce/report overall download progress? It would be quite a bad UX if we would not report it. The only way I would see is to have some separate thread which would repeatedly rescan output folder for the total size (would grow more and more expensive as more files are downloaded; or tricky to implement requiring smth like inotify) and report % from total size client obtains from the API. Taking care about explicit download of individual files would make such reporting trivial.

jwodder commented 2 years ago

@yarikoptic

So, altogether I feel that we might step into a shady territory, which we can at large avoid if we simply extend API to return a list of "subpaths" for zarr to be downloaded just as regular files, and then not worry about anything from above. WDYT?

Sounds like a good idea to me.

yarikoptic commented 2 years ago

ok, re "listing" -- added extra item to @satra 's https://github.com/dandi/dandi-api/issues/674 but I guess could be an issue on its own.

yarikoptic commented 2 years ago

@yarikoptic

So, altogether I feel that we might step into a shady territory, which we can at large avoid if we simply extend API to return a list of "subpaths" for zarr to be downloaded just as regular files, and then not worry about anything from above. WDYT?

Sounds like a good idea to me.

apparently already there! see https://github.com/dandi/dandi-api/issues/674#issuecomment-1011383921 and I checked it in staging swagger -- seems to work nicely, and since we would know all needed paths/urls to download -- we could parallelize it across this swarm of files and otherwise it is just a matter of downloading from urls (Last-Modified from url would be good to set on the file to decide if to redownload)

jwodder commented 2 years ago

@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location? What exactly should pyout display when downloading a multifile Zarr?

yarikoptic commented 2 years ago

@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location?

correct

edit: with an overall checksum comparison across the tree... if possible to make it "on the fly" as we do with digest for an individual file, but traversing the same order as desired for computation of the zarr checksum, would be awesome!

What exactly should pyout display when downloading a multifile Zarr?

I think for user reporting (pyout) we could consider zarr file (well -- directory) to be a single asset/file. So % progress would be based on "total" for that directory. Eventually we might want to return/expose some progress in # of files within zarr but I think there is no immediate need for that.

jwodder commented 2 years ago

@yarikoptic @dchiquito What should the Zarr upload do if a Zarr contains an empty directory? There doesn't seem to be a way to inform the API of such a directory's existence, yet the directory is still used in calculating the checksum, and so there will be a checksum mismatch after uploading.

dchiquito commented 2 years ago

I don't think well-formed zarr files can contain empty directories. I would assume it's not a thing, but perhaps we should appeal to someone with more knowledge than I.

S3 has no concept of "directories", so they can't be empty. My vote is that the checksum calculator should just ignore empty directories and the empty directory will not be considered a part of the zarr.

dandi / dandi-cli