Closed yarikoptic closed 2 years ago
@dchiquito Questions on the zarr upload API:
POST /api/zarr/{zarr_id}/upload/
say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?POST /zarr/
, what exactly should the "name" field in the request body be set to?POST /zarr/{zarr_id}/upload/
is "etag"
, not "md5"
. Which digest is supposed to be used?
POST /api/zarr/{zarr_id}/upload/
say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.POST /uploads/initialize/
, POST /zarr/{zarr_id}/upload/
does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?DELETE /api/zarr/{zarr_id}/files/
.use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: https://github.com/zarr-developers/zarr-python/issues/912
@dchiquito I'm trying to upload a trivial test zarr with the following code:
#!/usr/bin/env python3
import json
import os
from pathlib import Path
import sys
from dandi.dandiapi import DandiAPIClient, RESTFullAPIClient
from dandi.support.digests import get_dandietag
from dandi.utils import find_files
dandiset_id = sys.argv[1]
zarrdir = Path(sys.argv[2])
if zarrdir.suffix != ".zarr" or not zarrdir.is_dir():
sys.exit(f"{zarrdir} is not a zarr directory")
with DandiAPIClient.for_dandi_instance(
"dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
r = client.post("/zarr/", json={"name": zarrdir.name})
zarr_id = r["zarr_id"]
zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
upload_body = []
for zf in zfiles:
upload_body.append({
"path": zf.relative_to(zarrdir).as_posix(),
"etag": get_dandietag(zf).as_str(),
})
r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
with RESTFullAPIClient("http://nil.nil") as storage:
for upspec in r:
with (zarrdir / upspec["path"]).open("rb") as fp:
storage.put(upspec["upload_url"], data=fp, json_resp=False)
r = client.post(f"/zarr/{zarr_id}/upload/complete/")
#print(json.dumps(r, indent=4))
d = client.get_dandiset(dandiset_id, "draft", lazy=False)
r = client.post(
f"{d.version_api_path}assets/",
json={"metadata": {"path": zarrdir.name}, "zarr_id": zarr_id},
)
print(json.dumps(r, indent=4))
but I'm getting a 403 response when PUTting the files to S3:
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>Y16M1QN13AM6769S</RequestId><HostId>4uYw6UzqgIdTUIhqRXqW/PDmD6lXmIv53YYPHJFYv/u2rKi62895bV6jSCqrJCKF3qhJZJI1RCw=</HostId></Error>
@satra
files will not necessarily have zarr extension
Then how do we tell whether something is a zarr directory or not?
Then how do we tell whether something is a zarr directory or not?
open it as a NestedDirectoryStore in read mode with the zarr-python library. it should be able to give you information about groups and shapes, and metadata. we would essentially validate based on those and for the moment write our own requirements for the ngff files till the ome-zarr-py library implements those. but since this is generic zarr to start with, i would use the xarray or zarr python libraries to read the dataset.
@satra So we just try to open every single directory with the zarr library and see it if succeeds?
@satra:
use the zarr python library to open the nested directory store (i think to start with we will only support this backend). it will check consistency. files will not necessarily have zarr extension. in fact ngff uses .ngff extension. also ngff validator not in place at the moment, so zarr is the closest. i've posted this issue for seeking additional clarity: zarr-developers/zarr-python#912
@jwodder
So we just try to open every single directory with the zarr library and see it if succeeds?
Thus I think we should
.zarr
and .ngff
dandi organize
ed scheme (sub-*/{files}
) we have only 1 level down. In BIDS (sub-*/ses-*/{modality}/{files}/potentially-onemore
) we have 4.edit 1: I think it is ok to add zarr
as a dependency
do formalize "supported zarr directory name extensions" e.g. to a set of .zarr and .ngff
i think for the moment this would be fine.
zarr with also require compression codecs to be added.
@yarikoptic @satra Note that zarr seems to ignore files with invalid/unexpected names, and a *.zarr
directory containing only such files with no .zarray
or .zgroup
is treated as an empty group. How should directories like this be handled?
a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?
for the moment, let's say no empty groups allowed.
@yarikoptic
--allow-any-path
have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?a *.zarr directory containing only such files with no .zarray or .zgroup is treated as an empty group. How should directories like this be handled?
for the moment, let's say no empty groups allowed.
FWIW, pretty much we need zarr_validate
(as if to complement pynwb_validate
) which would do all those checks. Then it would be interfaced in validate
and upload
.
Should --allow-any-path have any effect on the treatment of Zarr directories, particularly invalid ones? If not, should there be a way to force processing of an invalid Zarr directory?
if per above we just add zarr_validate
and since we do allow to upload without validation -- uniform way would be to "allow" users to upload invalid zarrs if they say so via --validation [skip|ignore]
@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a .zarr
or .ngff
extension as a Zarr directory?
Also, how should metadata be determined for Zarr assets? (cc @satra)
@yarikoptic So, in the absence of or prior to validation, would we just treat any directory with a
.zarr
or.ngff
extension as a Zarr directory?
Yes
for the moment i would limit metadata extraction similar to the bids data, so based on names rather than internal metadata. in the future once we get better ngff metadata we will write additional extractors. i can help with the metadata extraction beyond basic metadata (size, encoding type, subject id). for dandi this would be part of bids datasets, so we will have additional info available for sticking into the asset metadata from participants.tsv and samples.tsv files.
@yarikoptic FYI: I'm implementing the metadata & validation for Zarr by giving NWB, Zarr, and generic files their own classes with metadata and validation methods; however, fscacher doesn't currently support caching of instance methods, so some caching is going to have to be disabled for now.
hm, do you see how fscacher could gain support for bound methods? if not, I wonder if we shouldn't just concentrate logic in @staticmethods of such classes which would be explicitly passed a path instead of an instance?
@yarikoptic We would have to add a variant of memoize_path()
that gets the affected path from a given attribute of the decorated method's self
parameter.
right -- sounds good! somehow it didn't occur to me ;)
The docs for POST /api/zarr/{zarr_id}/upload/ say "The number of files being uploaded must be less than some experimentally defined limit". What is the current/recommended limit?
Currently there is no limit. Now that we have the code up on a real server I need to do some experimentation to determine that value. The limit is mostly there to enforce that no request takes longer than 30 seconds, since Heroku will forcibly cancel any requests that exceed that timeout.
Number 3 in "Requirements" states "The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive," but the "Upload flow" section makes no mention of such a checksum. Does the client have to compute a checksum or not, and, if it does, where is it used?
The client would need to calculate the checksum to do integrity checking of the zarr upload. I wrote https://github.com/dandi/dandi-api/blob/master/dandiapi/api/zarr_checksums.py to handle all of that logic, and I put it in dandi-api for now for quicker iteration. It should be moved to a common location soon so that the CLI can use it as well.
The document says "The asset metadata will contain the information required to determine if an asset is a normal file blob or a zarr file." Exactly what information is that, and is it added by the client or the server?
That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?
For POST /zarr/, what exactly should the "name" field in the request body be set to?
I would assume that would be the name of the directory containing the zarr data, unless zarr metadata contains a more descriptive name.
".checksum files format" says "Each zarr file and directory in the archive has a path and a checksum (md5). For files, this is simply the ETag." So is the checksum for a file an md5 or a DANDI e-tag? If the latter, why are we using different digests for files on their own versus in directories? "Upload flow" states "The client calculates the MD5 of each file." and "The client sends the paths+MD5s to ...", but the field name in the request body for POST /zarr/{zarr_id}/upload/ is "etag", not "md5". Which digest is supposed to be used? Similarly, the docs for POST /api/zarr/{zarr_id}/upload/ say, "Requires a list of file paths and ETags (md5 checksums)". E-tags and MD5 checksums are not the same thing.
I would say we are using simple MD5, and that for these purposes MD5 and ETag are the same thing. Files in the zarr archive are limited to 5GB so that they can use the simple upload endpoint, and for files uploaded simply (as opposed to multipart), their ETag is the MD5 of the file. DANDI Etags are defined to be the ETags of multipart uploads, which are MD5s with a suffix relating to the number of parts.
Unlike POST /uploads/initialize/, POST /zarr/{zarr_id}/upload/ does not return information on different parts of files. Does this mean that each file in a zarr directory is to be uploaded in a single request? What if a file exceeds S3's part size limit of 5 GB?
Correct. Multipart upload requires substantially more engineering, so the executive decision was made to cap zarr component files at 5GB. Zarr is relatively easy to simply rechunk smaller if that is exceeded.
The Swagger docs don't describe the request body format for DELETE /api/zarr/{zarr_id}/files/.
Sorry, my bad. https://github.com/dandi/dandi-api/issues/648
How do you download a Zarr?
That is done directly from S3 using the zarr
library. You should be able to pass the S3 path of the zarr directory and zarr will read that directly from S3.
That seems ambigious to me at the moment. The server is not adding that metadata right now. @satra any preference on where/how asset metadata is updated?
this should be done as is done currently by the CLI for other types of files. the exact details of encodingformat can be worked out in the relevant CLI PR.
@jwodder I confirm that I am encountering the same error with your script. I will look closer next week.
@jwodder I got the script working against dandi-api-local-docker-tests
by changing the ETag from get_dandietag(zf).as_str()
to md5(blob).hexdigest()
. I'm still getting the 403 errors in staging though, so I am still investigating.
@jwodder I missed a spot with AWS permissions, the staging API has been updated. You need to specify X-Amz-ACL: bucket-owner-full-control
as a request header for the upload.
My adaptation of your example looks like this:
with DandiAPIClient.for_dandi_instance(
"dandi-staging", token=os.environ["DANDI_API_KEY"]
) as client:
r = client.post("/zarr/", json={"name": zarrdir.name})
zarr_id = r["zarr_id"]
zfiles = list(map(Path, find_files(r".*", str(zarrdir))))
upload_body = []
for zf in zfiles:
with open(zf, "rb") as f:
blob = f.read()
upload_body.append({
"path": zf.relative_to(zarrdir).as_posix(),
"etag": md5(blob).hexdigest(), # Simple MD5 instead of dandi-etag
})
r = client.post(f"/zarr/{zarr_id}/upload/", json=upload_body)
with RESTFullAPIClient("http://nil.nil") as storage:
for upspec in r:
with (zarrdir / upspec["path"]).open("rb") as fp:
storage.put(upspec["upload_url"], data=fp, json_resp=False, headers={"X-Amz-ACL": "bucket-owner-full-control"}) # X-Amz-ACL header is required
r = client.post(f"/zarr/{zarr_id}/upload/complete/")
@dchiquito
The client would need to calculate the checksum to do integrity checking of the zarr upload.
Could you elaborate on how this "integrity checking" fits into the upload flow? Exactly what does the client compare its computed value against?
@dchiquito could you please clarify above question of @jwodder ? In the example above you posted I do not see zarr "file" integrity checking anywhere. Is providing it optional???
My example is just John's snippet with some corrections, I did not include integrity checking in it.
After uploading the entire zarr archive, the CLI would query https://api.dandiarchive.org/api/zarr/{zarr_id}/
and compare the checksum
value to the checksum calculated against the local files. The file contents were checked as part of the upload process, but this checksum check verifies that all of the files were uploaded into the correct places within the archive.
If it fails, I'm not sure what the best path toward recovery would be. Recursively traversing the archive and identifying where in the tree the checksums begin to deviate would be challenging and potentially a lot of information to present. Aborting and retrying the upload would be simple.
the nested directory structure on a filesystem is just one way to store zarr data. and it was adopted to reduce the number of files at a given level by using the /
separator vs the .
separator which used to be the default. at the end of the day, zarr groups (i.e. n-dimensional arrays) are still simply a list of files. so each group should have a fixed depth and pattern.
a few thoughts.
@dchiquito Your modified code still isn't working for me; now the PUT fails with a 501 and the response:
<Error><Code>NotImplemented</Code><Message>A header you provided implies functionality that is not implemented</Message><Header>Transfer-Encoding</Header><RequestId>31Q78EZ8BQKMTNBK</RequestId><HostId>IZBDMAB3lKARg+2grxlKWqjOrv3m6wFHf6X3TdUpicJ+Fidrbh3vwHgFGrPBs65E5BprOM7iFcY=</HostId></Error>
EDIT: The problem goes away if I delete an empty file I added to the ZARR dir for testing purposes.
@dchiquito Observations after being able to upload a Zarr to staging:
/dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/
now include "blob" and "zarr" keys. Are these what should be used to determine whether an asset is a Zarr? If so, we would have a problem when accessing individual assets by ID, as the individual asset endpoints only return asset metadata, which does not include this information. @satra seems to imply that zarrness should be indicated in the metadata by setting the encodingFormat
metadata field to some as-yet-undetermined value; I wouldn't trust this unless the server validates correctness of this field.
contentURL
for a Zarr asset is of the form "zarr/9b107765-7011-4027-bead-81b5bf7aa028/"
; how would one use this to download a Zarr?@jwodder - can you please share a staging api url for a zarr asset so i can see the current metadata? indeed that should be harmonized with the schema across all assets.
dandi:dandi-zarr-checksum
for this to show up, it would need to be added to dandischema, and as with dandi-etag, i would say the implementation for generating this checksum should also be there. can someone send a PR with this?
regarding the encoding format filed an issue here: https://github.com/zarr-developers/zarr-specs/issues/123
@satra
$ curl -fsSL https://api-staging.dandiarchive.org/api/dandisets/100956/versions/draft/assets/2126d0e6-7733-44ac-9ab0-f9ac197c0507/ | jq .
{
"id": "dandiasset:2126d0e6-7733-44ac-9ab0-f9ac197c0507",
"path": "example.zarr",
"digest": {
"dandi:dandi-zarr-checksum": "bf5da3a8c757e736ac99f9d0d86106f4"
},
"contentUrl": [
"https://api.dandiarchive.org/api/assets/2126d0e6-7733-44ac-9ab0-f9ac197c0507/download/",
"zarr/9b107765-7011-4027-bead-81b5bf7aa028/"
],
"identifier": "2126d0e6-7733-44ac-9ab0-f9ac197c0507",
"contentSize": 862
}
thank you @jwodder - indeed under contentUrl
that prefix should be a full s3 path. let's use encodingFormat: "application/x-zarr"
for now and we can adjust if the zarr folks say otherwise.
@dchiquito - any idea while dandiset size is 0 for that dandiset ? are zarr assets not being included in the asset count ?
filed an api issue here: https://github.com/dandi/dandi-api/issues/674
@satra So, for Zarr assets, currently the metadata set by the client should be the same as for a generic non-NWB file, but with an encodingFormat
of "application/x-zarr"
, correct?
currently the metadata set by the client should be the same as for a generic non-NWB file
that is correct at present.
however, since the zarr file is also a bids file, we should also extract the bids metadata from the filename. pinging @thechymera as this is related to the bids validation work and should be added to dandi cli for augmenting metadata extraction for bids datasets/assets.
@dchiquito
contentUrl
is fixed)?/download/
URL listed in a Zarr asset's contentUrl
currently returns a 404. Should that URL even be present in the metadata?@jwodder - downloading may involve opening with the zarr library and then saving locally into the filename with a nested directory storage backend. the alternative would involve listing every object in the store with the uuid prefix, which i think would require an additional api endpoint to return paginated lists of all the objects.
I see that the asset structures returned by /dandisets/{versionsdandisetpk}/versions/{versions__version}/assets/ now include "blob" and "zarr" keys. Are these what should be used to determine whether an asset is a Zarr? If so, we would have a problem when accessing individual assets by ID, as the individual asset endpoints only return asset metadata, which does not include this information. @satra seems to imply that zarrness should be indicated in the metadata by setting the encodingFormat metadata field to some as-yet-undetermined value; I wouldn't trust this unless the server validates correctness of this field.
My plan was for those properties would be available to determine zarr-ness. The API currently isn't enforcing an encodingType appropriate for zarr files, but it should: https://github.com/dandi/dandi-api/issues/676
Is the new "blob" key ever of any use/relevance to the client?
I don't think so.
The non-API contentURL for a Zarr asset is of the form "zarr/9b107765-7011-4027-bead-81b5bf7aa028/"; how would one use this to download a Zarr?
Coming soon https://github.com/dandi/dandi-api/issues/677
dandi:dandi-zarr-checksum for this to show up, it would need to be added to dandischema, and as with dandi-etag, i would say the implementation for generating this checksum should also be there. can someone send a PR with this?
I added the new digest type in https://github.com/dandi/dandischema/pull/108. I will move the checksum calculation code next week.
Can you provide code for downloading a Zarr from Dandiarchive (assuming the lack of a proper S3 URL in contentUrl is fixed)?
I will do this ASAP next week. It will involve the zarr
python client.
@dchiquito
the executive decision was made to cap zarr component files at 5GB
So is the client supposed to do file-size checking? Zarr component file sizes don't seem to be reported to the API prior to actually uploading.
Any S3 request to upload a file >5GB will fail, but I'm not sure of when or what the response code would be. If you want a better error message you could do file-size checking preemptively.
Can you provide code for downloading a Zarr from Dandiarchive (assuming the lack of a proper S3 URL in contentUrl is fixed)?
@jwodder I cannot for the life of me figure out how to properly save a zarr group. This is what I have so far:
import zarr
ZARR_ID = '79bb04a0-3f1b-46ae-8821-1071fc69ed6e'
DOWNLOAD_PATH = '/home/daniel/git/dandi-api/download.zarr'
def download_array(path: str, array: zarr.Array):
zarr.save(f'{DOWNLOAD_PATH}{array.name}', array)
def download_group(path: str, group: zarr.Group):
for k in group.array_keys():
download_array(path, group[k])
for k in group.group_keys():
download_group(path, group[k])
# TODO how to save the .zgroup information?
if __name__ == '__main__':
z = zarr.open(f'https://api-staging.dandiarchive.org/api/zarr/{ZARR_ID}.zarr/')
download_group(DOWNLOAD_PATH, z)
print(z.tree())
It will recursively save all of the arrays in the group, which is all of the actual data, but the .zgroup
data is lost so zarr can't load the information again.
@jwodder - downloading may involve opening with the zarr library and then saving locally into the filename with a nested directory storage backend. the alternative would involve listing every object in the store with the uuid prefix, which i think would require an additional api endpoint to return paginated lists of all the objects.
Sounds like using zarr
library is indeed the easiest way forward, but my gut "resists" a little, since it is quite a heavy use of a library to just download content from the archive. Some concerns/questions I have
I really hope that we would find some z.save
or alike and avoid crafting ad-hoc code since then every user interfacing with API directly would need to redo that.
is there a guarantee that such zarr.save
would produce byte-to-byte identical copy? or there could be some variability in filenaming, or having some timestamps embedded in data, or data files getting recompressed. Would behavior be identical across zarr library version changes?
If any effects from above -- our checksum validation would fail, we loose integrity validation
Please correct me if we are "all ok" in a following hypothetical use case:
zarr.save
which doesn't bother about downloading extra files, checksum would mismatch since locally we would not have those extra filesSo, altogether I feel that we might step into a shady territory, which we can at large avoid if we simply extend API to return a list of "subpaths" for zarr to be downloaded just as regular files, and then not worry about anything from above. WDYT?
edit 1: extra aspects which keep coming up
@yarikoptic
So, altogether I feel that we might step into a shady territory, which we can at large avoid if we simply extend API to return a list of "subpaths" for zarr to be downloaded just as regular files, and then not worry about anything from above. WDYT?
Sounds like a good idea to me.
ok, re "listing" -- added extra item to @satra 's https://github.com/dandi/dandi-api/issues/674 but I guess could be an issue on its own.
@yarikoptic
So, altogether I feel that we might step into a shady territory, which we can at large avoid if we simply extend API to return a list of "subpaths" for zarr to be downloaded just as regular files, and then not worry about anything from above. WDYT?
Sounds like a good idea to me.
apparently already there! see https://github.com/dandi/dandi-api/issues/674#issuecomment-1011383921 and I checked it in staging swagger -- seems to work nicely, and since we would know all needed paths/urls to download -- we could parallelize it across this swarm of files and otherwise it is just a matter of downloading from urls (Last-Modified
from url would be good to set on the file to decide if to redownload)
@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location? What exactly should pyout display when downloading a multifile Zarr?
@yarikoptic So when the client is asked to download a Zarr, should it, instead of downloading a monolithic asset, effectively treat the Zarr as a tree of independent blobs, each one of which is separately compared against whatever's on-disk at the final location?
correct
edit: with an overall checksum comparison across the tree... if possible to make it "on the fly" as we do with digest for an individual file, but traversing the same order as desired for computation of the zarr checksum, would be awesome!
What exactly should pyout display when downloading a multifile Zarr?
I think for user reporting (pyout) we could consider zarr file (well -- directory) to be a single asset/file. So % progress would be based on "total" for that directory. Eventually we might want to return/expose some progress in # of files within zarr but I think there is no immediate need for that.
@yarikoptic @dchiquito What should the Zarr upload do if a Zarr contains an empty directory? There doesn't seem to be a way to inform the API of such a directory's existence, yet the directory is still used in calculating the checksum, and so there will be a checksum mismatch after uploading.
I don't think well-formed zarr files can contain empty directories. I would assume it's not a thing, but perhaps we should appeal to someone with more knowledge than I.
S3 has no concept of "directories", so they can't be empty. My vote is that the checksum calculator should just ignore empty directories and the empty directory will not be considered a part of the zarr.
.zarr
extension is a zarr archive.zarray
or.zgroup
but I wonder if that would be reliable enough?