ctengel / objectindex

Index your objects
GNU General Public License v3.0
1 stars 0 forks source link

Decide on hash algorithm #24

Closed ctengel closed 1 year ago

ctengel commented 2 years ago

ThinkPad T480

Linux 5.16 x86_64
OpenSSL 1.1
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5             144408.52k   332710.21k   585290.75k   727432.82k   774433.45k   786842.97k
sha1            161048.25k   380297.82k   756203.95k  1010608.55k  1107449.17k  1121147.80k
sha256           90375.32k   203177.00k   377314.80k   465151.32k   499690.08k   507407.02k

RPi 3B+

Linux 5.15 aarch64
OpenSSL 1.1
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5              24452.08k    60395.92k   114887.40k   147979.96k   157034.77k   160276.06k
sha1             20013.15k    53154.16k   104001.06k   133692.21k   145674.52k   143513.77k
sha256           12993.65k    30759.65k    56388.21k    70519.94k    77019.32k    77481.01k

So obviously the x86 is faster, and sha256 is always slowest. But the ARM is faster with MD5 but the x86 is faster with the SHA1 - interesting!

ctengel commented 2 years ago

https://www.nayuki.io/page/fast-sha1-hash-implementation-in-x86-assembly

ctengel commented 1 year ago

Does HTTP HEAD ever give a hash? what method is that?

ctengel commented 1 year ago

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Digest

ctengel commented 1 year ago

Note that Digest has been obsoleted but is planned to be replaced (see RFC https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-digest-headers#name-establish-the-hash-algorith )... regardless it seems sha-256 and sha-512 are preffered

The idea here is to be able to HEAD an HTTP resource, get its hash that way (via Digest or Content-Digest), dip objectindex object table, and -

For reasons I cannot explain, sha512 seems faster on x86 than sha256. (but seems to be faster on 64 bit generally) Therefore maybe sha512 is the way to go.

Linux 6.3.5 x86_64
version: 3.0.9
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5              19009.13k    55787.73k   142004.66k   200917.36k   234368.17k   202787.74k
sha1             20177.02k    74188.78k   174359.80k   266299.10k   320909.23k   312474.30k
sha256           16472.64k    45346.53k    97307.57k   130726.66k   146801.74k   146568.31k
sha512           13587.98k    54229.13k   104519.06k   141782.28k   192102.40k   214531.44k

I think for the HTTP reasons alone that's probably way to go.

So let's plan DB to have 64 byte hash value (to fit sha512), verify that it's faster on raspberrypi/arm as well, and then go with that.

ctengel commented 1 year ago

above was on f37

similar results on rpi 3B+ w/ Linux 5.15.74 aarch64 (Debian 11ish Raspbian)

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5              29351.07k    72628.29k   136599.55k   175508.46k   190387.54k   191785.64k
sha1             24761.49k    62742.53k   119246.08k   151959.55k   166666.24k   164309.67k
sha256           14442.52k    32866.65k    58433.45k    75202.56k    80606.55k    80161.45k
sha512           11283.50k    44313.92k    73455.19k   104479.06k   120225.79k   121454.59k

sha512 faster than sha256

however go to 32 bit OS and it is slower

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
md5              21189.23k    65327.80k   134511.73k   182284.97k   204038.14k   206421.96k
sha1             14268.48k    37437.88k    67024.30k    84963.92k    92004.35k    92591.45k
sha256           10608.10k    28414.04k    52243.71k    65930.53k    71453.97k    71658.15k
sha512            3546.53k    14081.11k    20494.85k    28111.03k    31548.79k    31957.02k

Raspbian 9 Linux 4.19.66 armv7l

So sha512 is the winner here since faster on both ARM and Intel 64 bit

Now to update DB etc...

ctengel commented 1 year ago

OK - field has been increased to 64 bytes to support sha512. However, additional questions brought up. HTTP on a whole is moving to sha256 or sha512. Of those sha512 is quicker on 64 bit systems, 256 on 32 (at least on arm32)

So sha512 seemed like logical choice. But what about s3/minio? does it have any builtin checksum?

ctengel commented 1 year ago

Again like with HTTP case need to understand use case...

if yes then maybe there is a reason to match, if no then no

ctengel commented 1 year ago

yes, checksums can be provided as headers on PUT - any of the 3 we are discussing here plus crc https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html and https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html

MD5 is also "ETag" SOME of the time (no encryption or multipart)... and it's pretty much everywhere- https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

response seems to always have ETag, but only the others if uploaded with it

idk leaning toward md5 or sha256 now...

ctengel commented 1 year ago

crc32 is not at all an option since needs to be unique - chances of collision are high

ctengel commented 1 year ago

sha256 is the only one common encouraged to both

to answer above question -

sha256 may be the way to go

see also:

ctengel commented 1 year ago

sha256 is the way to go

ctengel commented 1 year ago

Will consider sha512, sha1, or md5 if performance becomes an issue on sha256