dandi / zarr_checksum

Algorithms for calculating a zarr checksum against local or cloud storage
https://pypi.org/project/zarr-checksum/
Apache License 2.0
11 stars 3 forks source link

Generalize checksum algorithm #56

Open jjnesbitt opened 7 months ago

jjnesbitt commented 7 months ago

Currently md5 is assumed to be the choice of checksum algorithm, but we should allow for the user to supply their own algorithm if they so choose.

yarikoptic commented 7 months ago

is there demand/use-case to target here? FWIW - md5 is chosen since it is the one used by AWS for ETag compute so we then

jjnesbitt commented 6 months ago

I'm not proposing change the default behavior, I think that should stay as md5, to match S3's implementation (as that is the initial reason for choosing md5). However, this conversation in the zarr-python repo highlighted someone's need for this tool, but with a different hashing algorithm.

Since this seems like it would be a common use case, and in that thread we got a pseudo-endorsement from one of the zarr-python contributors to use this package (since this functionality doesn't currently exist in zarr), I think it would be worthwhile to generalize the algorithm in a backwards compatible way.

This will probably have to wait until higher priority things have been addressed in DANDI, although I might take some of own time to poke around at this, since it interests me.