Closed dchiquito closed 2 years ago
@dchiquito and @yarikoptic - while we are updating this format, can we consider a digest of the form <md5>-<#files_in_tree>
. this would make it somewhat analogous to our multipart uploader for normal objects.
@satra are you proposing that a file's digest is simply <md5>
and a directory's digest is <md5>-<number_of_children>
, where number_of_children
is the number of files and directories that the directory contains?
I personally don't see what value that gives us. The multipart upload checksum form is <md5>-<number_of_parts>
, as opposed to <md5>
for normal uploads. This is useful because it indicates which checksumming algorithm was used on a file.
For zarr checksums, there is only one checksumming algorithm for files, and one for directories.
a directory's digest is
- , where number_of_children is the number of files and directories that the directory contains?
a directory digest is the md5 that we are computing, which is simply an md5 of the json, but the -<part>
would be the total number of leaves in the entire tree of that node.
what value that gives us.
it gives us a quick representation of the set of components that make up the eventual checksum. just like multipart count simply tells us the set of parts that went into computing the etag.
btw, the checksumming algorithm used is whatever we define the schema property to be and fix the algorithm to be in dandischema. the checksum itself doesn't tell us anything about the algorithm at this point. (we did not go to multihash for this). and we do anticipate that future versions of the archive may have other algorithms that are used (e.g., based on blake or others).
the -
would be the total number of leaves in the entire tree of that node.
So a zarr containing 1,000,000 files would have a checksum <md5>-1000000
? I suppose that might be useful. I will amend https://github.com/dandi/dandischema/pull/120
greedy me, echoing the "size" in the original description, may be it should be <md5>-{#files}--{size}
? some kind of rationale would be that for dandi-etag
since we decide on how to chunk multipart upload we can deduce size "order" merely from seeing -#parts
, but #files
for zarr alone give us no idea on size. anecdotal support: git-annex
has file size as part of its key for the content, so it is {backend-type-e.g.MD5}-s{size}--{digest}[{.extension-if-backendwithE}]
. We found having that -s{size}
handy on a number of occasions, e.g. not needing to do any remote calls to inquire about the size of the thing -- speed ups! So far we have not encountered some evil malicious actors adding knowingly wrong size ;)
The way the API would be set up once https://github.com/dandi/dandi-archive/issues/925 is addressed, the API would include the checksum and size whenever a zarr file/directory is queried. I don't see a lot of value in also embedding that data into the checksum, but I have no moral objection.
@satra any objection to <md5>-<number_of_files>--<size>
?
any objection to
- --
no moral objection
@dchiquito Should this issue have been closed? The /api/zarr/{zarr_id}.zarr/{path}
responses still don't include file size, and the new checksum algorithm doesn't seem to have been adopted by dandi-archive yet.
The zarr
.checksum
file format has some deficiencies which should be addressed ASAP before too many zarrs are uploaded.name
field instead of apath
. This should resolve https://github.com/dandi/dandi-archive/issues/931.checksum
file is used to populate the/zarr/{zarr_id}.zarr/
view, so the best way to include file size in that view would be to include size in the.checksum
files. We can also include directory size for free by summing the sizes of their constituent files/directories. This should resolve https://github.com/dandi/dandi-archive/issues/925