Closed ctengel closed 1 year ago
Does HTTP HEAD ever give a hash? what method is that?
Note that Digest
has been obsoleted but is planned to be replaced (see RFC https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-digest-headers#name-establish-the-hash-algorith )... regardless it seems sha-256
and sha-512
are preffered
The idea here is to be able to HEAD
an HTTP resource, get its hash that way (via Digest
or Content-Digest
), dip objectindex object
table, and -
file
table with given URL linked to said object. No need to downloadDigest
. Retry if mismatchFor reasons I cannot explain, sha512 seems faster on x86 than sha256. (but seems to be faster on 64 bit generally) Therefore maybe sha512 is the way to go.
Linux 6.3.5 x86_64
version: 3.0.9
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 19009.13k 55787.73k 142004.66k 200917.36k 234368.17k 202787.74k
sha1 20177.02k 74188.78k 174359.80k 266299.10k 320909.23k 312474.30k
sha256 16472.64k 45346.53k 97307.57k 130726.66k 146801.74k 146568.31k
sha512 13587.98k 54229.13k 104519.06k 141782.28k 192102.40k 214531.44k
I think for the HTTP reasons alone that's probably way to go.
So let's plan DB to have 64 byte hash value (to fit sha512), verify that it's faster on raspberrypi/arm as well, and then go with that.
above was on f37
similar results on rpi 3B+ w/ Linux 5.15.74 aarch64 (Debian 11ish Raspbian)
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 29351.07k 72628.29k 136599.55k 175508.46k 190387.54k 191785.64k
sha1 24761.49k 62742.53k 119246.08k 151959.55k 166666.24k 164309.67k
sha256 14442.52k 32866.65k 58433.45k 75202.56k 80606.55k 80161.45k
sha512 11283.50k 44313.92k 73455.19k 104479.06k 120225.79k 121454.59k
sha512 faster than sha256
however go to 32 bit OS and it is slower
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
md5 21189.23k 65327.80k 134511.73k 182284.97k 204038.14k 206421.96k
sha1 14268.48k 37437.88k 67024.30k 84963.92k 92004.35k 92591.45k
sha256 10608.10k 28414.04k 52243.71k 65930.53k 71453.97k 71658.15k
sha512 3546.53k 14081.11k 20494.85k 28111.03k 31548.79k 31957.02k
Raspbian 9 Linux 4.19.66 armv7l
So sha512 is the winner here since faster on both ARM and Intel 64 bit
Now to update DB etc...
OK - field has been increased to 64 bytes to support sha512. However, additional questions brought up. HTTP on a whole is moving to sha256 or sha512. Of those sha512 is quicker on 64 bit systems, 256 on 32 (at least on arm32)
So sha512 seemed like logical choice. But what about s3/minio? does it have any builtin checksum?
Again like with HTTP case need to understand use case...
if yes then maybe there is a reason to match, if no then no
yes, checksums can be provided as headers on PUT - any of the 3 we are discussing here plus crc https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html and https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html
MD5 is also "ETag" SOME of the time (no encryption or multipart)... and it's pretty much everywhere- https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
response seems to always have ETag, but only the others if uploaded with it
idk leaning toward md5 or sha256 now...
crc32 is not at all an option since needs to be unique - chances of collision are high
sha256 is the only one common encouraged to both
to answer above question -
sha256 may be the way to go
see also:
sha256 is the way to go
Will consider sha512, sha1, or md5 if performance becomes an issue on sha256
ThinkPad T480
RPi 3B+
So obviously the x86 is faster, and sha256 is always slowest. But the ARM is faster with MD5 but the x86 is faster with the SHA1 - interesting!