SNSystems / llvm-project-prepo

Fork of LLVM with modifications to support a program repository
26 stars 0 forks source link

Replace the MD5 message-digest algorithm #68

Open paulhuggett opened 4 years ago

paulhuggett commented 4 years ago

We used MD5 in the prepo hash-generation code simply “because it was there”: LLVM had a pre-existing implementation which meant that we could get started quickly. However, MD5 is “ cryptographically broken and unsuitable for further use”. In addition, more modern alternatives offer improved performance: BLAKE3 claims to be roughly 8 times faster.

paulhuggett commented 4 years ago

I’ve been looking at replacing the hashing algorithm used by rld-gen. The primary goal is to improve the performance of the tool but I also need to ensure that the resulting index trees are well balanced.

I’m evaluating three hashing algorithms: MD5 (currently being used because there was an existing implementation in the LLVM source tree!), XXH3, and BLAKE3.

The first statistics are for performance. The command below was run 5 times for each of the 3 hashing algorithms.

rm db.db
time rld-gen \
    --data-fibonacci \
    --external=50000 \
    --linkonce=50000 \
    --modules=500 \
    --output-directory=. \
    --repo=db.db

This produces fragment and compilation indexes with 25,050,000 and 500 entries respectively. Results are for the mean of the 5 runs.

image

The other factor is that of the tree layout. The pstore-index-stats tool is able to provide us with an idea of the tree layout (which depends on the hash function producing good quality output).

The first data is for the tree branching factor:

image

The second is for the mean leaf depth:

image

paulhuggett commented 4 years ago

I’m evaluating three hashing algorithms: MD5 (currently being used because there was an existing implementation in the LLVM source tree!), XXH3, and BLAKE3.

Note that there many other published algorithms some of which may turn out to be better choices. For the moment, I’m sticking with just these three.