Update zarr checksum file format

dchiquito commented 2 years ago

The zarr .checksum file format has some deficiencies which should be addressed ASAP before too many zarrs are uploaded.

Use a name instead of path. Currently, files and directories are identified by a full path within the zarr. This means that if the same directory tree is present twice in a zarr archive, it will have a different checksum because of the different path. We should use a name field instead of a path. This should resolve https://github.com/dandi/dandi-archive/issues/931
Include size in the checksum file. The .checksum file is used to populate the /zarr/{zarr_id}.zarr/ view, so the best way to include file size in that view would be to include size in the .checksum files. We can also include directory size for free by summing the sizes of their constituent files/directories. This should resolve https://github.com/dandi/dandi-archive/issues/925

satra commented 2 years ago

@dchiquito and @yarikoptic - while we are updating this format, can we consider a digest of the form <md5>-<#files_in_tree>. this would make it somewhat analogous to our multipart uploader for normal objects.

dchiquito commented 2 years ago

@satra are you proposing that a file's digest is simply <md5> and a directory's digest is <md5>-<number_of_children>, where number_of_children is the number of files and directories that the directory contains?

I personally don't see what value that gives us. The multipart upload checksum form is <md5>-<number_of_parts>, as opposed to <md5> for normal uploads. This is useful because it indicates which checksumming algorithm was used on a file. For zarr checksums, there is only one checksumming algorithm for files, and one for directories.

satra commented 2 years ago

a directory's digest is -, where number_of_children is the number of files and directories that the directory contains?

a directory digest is the md5 that we are computing, which is simply an md5 of the json, but the -<part> would be the total number of leaves in the entire tree of that node.

what value that gives us.

it gives us a quick representation of the set of components that make up the eventual checksum. just like multipart count simply tells us the set of parts that went into computing the etag.

satra commented 2 years ago

btw, the checksumming algorithm used is whatever we define the schema property to be and fix the algorithm to be in dandischema. the checksum itself doesn't tell us anything about the algorithm at this point. (we did not go to multihash for this). and we do anticipate that future versions of the archive may have other algorithms that are used (e.g., based on blake or others).

dchiquito commented 2 years ago

the - would be the total number of leaves in the entire tree of that node.

So a zarr containing 1,000,000 files would have a checksum <md5>-1000000? I suppose that might be useful. I will amend https://github.com/dandi/dandischema/pull/120

yarikoptic commented 2 years ago

greedy me, echoing the "size" in the original description, may be it should be <md5>-{#files}--{size}? some kind of rationale would be that for dandi-etag since we decide on how to chunk multipart upload we can deduce size "order" merely from seeing -#parts, but #files for zarr alone give us no idea on size. anecdotal support: git-annex has file size as part of its key for the content, so it is {backend-type-e.g.MD5}-s{size}--{digest}[{.extension-if-backendwithE}]. We found having that -s{size} handy on a number of occasions, e.g. not needing to do any remote calls to inquire about the size of the thing -- speed ups! So far we have not encountered some evil malicious actors adding knowingly wrong size ;)

dchiquito commented 2 years ago

The way the API would be set up once https://github.com/dandi/dandi-archive/issues/925 is addressed, the API would include the checksum and size whenever a zarr file/directory is queried. I don't see a lot of value in also embedding that data into the checksum, but I have no moral objection.

@satra any objection to <md5>-<number_of_files>--<size>?

satra commented 2 years ago

any objection to ---

no moral objection

jwodder commented 2 years ago

@dchiquito Should this issue have been closed? The /api/zarr/{zarr_id}.zarr/{path} responses still don't include file size, and the new checksum algorithm doesn't seem to have been adopted by dandi-archive yet.

dandi / dandi-archive

Update zarr checksum file format #937