ARK-Builders / arklib

Core of the programs in ARK family
MIT License
1 stars 10 forks source link

Benchmark faster hash functions #76

Open kirillt opened 5 months ago

kirillt commented 5 months ago

We are interested in overall performance for index construction, i.e. both pure hashing speed and collisions amount are important. Note that there are at least 2 kinds of t1ha function: t1ha0 (the fastest) and t1ha1 (portable).

We should also measure t1ha2 and t1ha3 since they should have less collisions.

https://github.com/PositiveTechnologies/t1ha

kirillt commented 5 months ago

These functions could be the fastest, too:

Double-check with https://rurban.github.io/smhasher/

tareknaser commented 5 months ago

Here are some hash functions that are worth benchmarking:

tareknaser commented 5 months ago

Note that "FarmHash" is not portable and is too machine-specific, as it operates differently on 64-bit and 32-bit systems.

kirillt commented 5 months ago

From the preliminary results (very limited data) on these benchmarks:

I could rank the functions like this:

  1. gxhash
  2. wyhash
  3. t1ha0
  4. farmhash
  5. highway
  6. xxh3
  7. crc32

@tareknaser how could we approach realistic benchmarking? Something like building index of a folder with 10K files from 100K to 2MB each? Does it make sense generate such a test dataset? I'll also run the benches against my own real data (using the branches you've created).

We also could have collision ratios extracted from such a benchmark.

kirillt commented 5 months ago

Benchmarks using this data:

$ ncdu
--- /mnt/data/test_all --------------------------------------------------------------------
    4.8 GiB [#############] /test2
    4.5 GiB [############ ] /test

$ find test | rg -v .ark | wc -l
12276

$ find test2 | rg -v .ark | wc -l
28

main branch:

Benchmarking index_build/index_build//mnt/data/test_all/: Collecting 100 samples in estimatindex_build/index_build//mnt/data/test_all/
                        time:   [5.2732 s 5.2812 s 5.2897 s]

Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Collisions: 0

hash_gxhash branch:

Gnuplot not found, using plotters backend
Benchmarking index_build/index_build//mnt/data/test_all/: Collecting 100 samples in estimatindex_build/index_build//mnt/data/test_all/
                        time:   [4.4865 s 4.4969 s 4.5073 s]

Collisions: 0

tareknaser/betterhash branch (Blake3)

Benchmarking index_build/index_build//mnt/data/test_all/: Warming up for 3.0000 s
index_build/index_build//mnt/data/test_all/
                        time:   [11.234 s 11.644 s 12.108 s]

Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) high mild
  11 (11.00%) high severe

I'm a bit confused about these 2 points:

Looks that we need profiles, too:


Up: wyhash: 5.35 sec t1ha0: 5.1 sec

tareknaser commented 5 months ago

Looking at the benchmark results, here's what I think:

For gxhash:

For blake3:

We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes. We can also consider other cryptographic hash functions and see how they compare to blake3.

tareknaser commented 5 months ago

how could we approach realistic benchmarking?

Our current benchmarks index the tests/ directory, which is a good starting point but isn't ideal for performance evaluations. In CI, we try to provide quick feedback on potential issues rather than comprehensive performance assessments because of resource limitations.

Profiling can help identify performance bottlenecks within arklib. Benchmarks tell us "how fast", profiling tells us "why".

However, if we opt for large datasets for benchmarks, we can store them separately from the project code on GitHub. Then, use tools like wget or reqwest to fetch datasets during benchmark execution.

kirillt commented 5 months ago

@tareknaser

crc32 uses a smaller hash (u32) compared to gxhash (u64). This difference can slow down lookups in ResourceIndex

Do you think this could be the reason on 64-bit architecture?

Bigger hashes like u64 usually have fewer collisions, which is good. But getting zero collisions with gxhash seems strange, let's double-check that.

Totally. But CRC-32 has zero collisions, too. If I remember correctly, for test_all/test folder I had something like at least 1 collision per 1,000 resources. I would expect around 10 collisions total for main branch. Probably it's worth to check indexing code and also fix this command (at the moment it seems to not print the collisions number).

Benchmarks online might not be helpful here since we do more than just hashing. Using its larger hash ([u8; 32]) for lookups adds extra work.

Agree. But the strange part here is that difference in hashing performance between CRC-32 and Blake-3 is the same as the difference in indexing performance...

We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes.

Agree.

We can also consider other cryptographic hash functions and see how they compare to blake3.

Do you think there are faster cryptographic functions? We could try, but let's first implement this for easier experimentation:

Our current benchmarks index the tests/ directory, which is a good starting point but isn't ideal for performance evaluations. In CI, we try to provide quick feedback on potential issues rather than comprehensive performance assessments because of resource limitations.

That's all correct. I just think we need also some heavy automated indexing test which will be run only manually. We could also collect some huge dataset and store it in some cloud storage provider, but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think?

tareknaser commented 5 months ago

Do you think this could be the reason on 64-bit architecture?

I think it contributes, yes. But maybe not the sole reason.

but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think?

I think that generating random data would be more straightforward to implement and maintain. But using the same dataset is better for benchmarking as it provides a more accurate representation of real-world data.

Generating random data would be the better choice for now. we can always switch to a fixed dataset later.

tareknaser commented 5 months ago

Apparently, aHash is not supposed to be used in our case AHash does not have a fixed standard for its output