Open kirillt opened 10 months ago
These functions could be the fastest, too:
Double-check with https://rurban.github.io/smhasher/
Here are some hash functions that are worth benchmarking:
Note that "FarmHash" is not portable and is too machine-specific, as it operates differently on 64-bit and 32-bit systems.
From the preliminary results (very limited data) on these benchmarks:
I could rank the functions like this:
@tareknaser how could we approach realistic benchmarking? Something like building index of a folder with 10K files from 100K to 2MB each? Does it make sense generate such a test dataset? I'll also run the benches against my own real data (using the branches you've created).
We also could have collision ratios extracted from such a benchmark.
Benchmarks using this data:
$ ncdu
--- /mnt/data/test_all --------------------------------------------------------------------
4.8 GiB [#############] /test2
4.5 GiB [############ ] /test
$ find test | rg -v .ark | wc -l
12276
$ find test2 | rg -v .ark | wc -l
28
main
branch:
Benchmarking index_build/index_build//mnt/data/test_all/: Collecting 100 samples in estimatindex_build/index_build//mnt/data/test_all/
time: [5.2732 s 5.2812 s 5.2897 s]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
Collisions: 0
hash_gxhash
branch:
Gnuplot not found, using plotters backend
Benchmarking index_build/index_build//mnt/data/test_all/: Collecting 100 samples in estimatindex_build/index_build//mnt/data/test_all/
time: [4.4865 s 4.4969 s 4.5073 s]
Collisions: 0
tareknaser/betterhash
branch (Blake3)
Benchmarking index_build/index_build//mnt/data/test_all/: Warming up for 3.0000 s
index_build/index_build//mnt/data/test_all/
time: [11.234 s 11.644 s 12.108 s]
Found 13 outliers among 100 measurements (13.00%)
2 (2.00%) high mild
11 (11.00%) high severe
I'm a bit confused about these 2 points:
gxhash
is just 15% faster, more time spent on something else like putting into tree structure? Or reading from disk, especially when it's not SSD. But blake3
is 120% times slower as expected.Looks that we need profiles, too:
Up: wyhash: 5.35 sec t1ha0: 5.1 sec
Looking at the benchmark results, here's what I think:
For gxhash
:
crc32
uses a smaller hash (u32) compared to gxhash
(u64). This difference can slow down lookups in ResourceIndex
.gxhash
seems strange, let's double-check that.For blake3
:
gxhash
, as expected. Benchmarks online might not be helpful here since we do more than just hashing.[u8; 32]
) for lookups adds extra work.We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes.
We can also consider other cryptographic hash functions and see how they compare to blake3
.
how could we approach realistic benchmarking?
Our current benchmarks index the tests/
directory, which is a good starting point but isn't ideal for performance evaluations. In CI, we try to provide quick feedback on potential issues rather than comprehensive performance assessments because of resource limitations.
Profiling can help identify performance bottlenecks within arklib. Benchmarks tell us "how fast", profiling tells us "why".
However, if we opt for large datasets for benchmarks, we can store them separately from the project code on GitHub. Then, use tools like wget
or reqwest
to fetch datasets during benchmark execution.
@tareknaser
crc32 uses a smaller hash (u32) compared to gxhash (u64). This difference can slow down lookups in ResourceIndex
Do you think this could be the reason on 64-bit architecture?
Bigger hashes like u64 usually have fewer collisions, which is good. But getting zero collisions with gxhash seems strange, let's double-check that.
Totally. But CRC-32 has zero collisions, too. If I remember correctly, for test_all/test
folder I had something like at least 1 collision per 1,000 resources. I would expect around 10 collisions total for main
branch. Probably it's worth to check indexing code and also fix this command (at the moment it seems to not print the collisions number).
Benchmarks online might not be helpful here since we do more than just hashing. Using its larger hash ([u8; 32]) for lookups adds extra work.
Agree. But the strange part here is that difference in hashing performance between CRC-32 and Blake-3 is the same as the difference in indexing performance...
We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes.
Agree.
We can also consider other cryptographic hash functions and see how they compare to blake3.
Do you think there are faster cryptographic functions? We could try, but let's first implement this for easier experimentation:
Our current benchmarks index the tests/ directory, which is a good starting point but isn't ideal for performance evaluations. In CI, we try to provide quick feedback on potential issues rather than comprehensive performance assessments because of resource limitations.
That's all correct. I just think we need also some heavy automated indexing test which will be run only manually. We could also collect some huge dataset and store it in some cloud storage provider, but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think?
Do you think this could be the reason on 64-bit architecture?
I think it contributes, yes. But maybe not the sole reason.
but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think?
I think that generating random data would be more straightforward to implement and maintain. But using the same dataset is better for benchmarking as it provides a more accurate representation of real-world data.
Generating random data would be the better choice for now. we can always switch to a fixed dataset later.
Apparently, aHash
is not supposed to be used in our case
AHash does not have a fixed standard for its output
We are interested in overall performance for index construction, i.e. both pure hashing speed and collisions amount are important. Note that there are at least 2 kinds of
t1ha
function:t1ha0
(the fastest) andt1ha1
(portable).We should also measure
t1ha2
andt1ha3
since they should have less collisions.https://github.com/PositiveTechnologies/t1ha