arthurlm / one-brc-rs

The One Billion Row Challenge in rust
MIT License
0 stars 0 forks source link
1brc rust

1️⃣🐝🏎️ The One Billion Row Challenge in Rust 🦀

My own implementation of the 1BRC but in Rust.

Test result on my local machine

Test are run under WSL2 (so using Hyper-V).

Hardware:

Before running anything. There is a huge performance difference when data are in memory cache or not. So for every result, I will separate first run and real benchmark.

NOTE: I am cleaning memory cache using following command as root:

sync; echo 1 > /proc/sys/vm/drop_caches

Few utilities to deal with this can be found under the utils directory.

Java implementation

I do not have GraalVm, so I will only run the best pure Java implementation.

# First run:
❯ \time -f 'Elapsed=%E' ./calculate_average_yourwass.sh > /dev/null
Elapsed=0:03.14

# After few runs (so cache is now warm):
❯ \time -f 'Elapsed=%E' ./calculate_average_yourwass.sh > /dev/null
Elapsed=0:00.93

# With a RAM disk performances are constant but a little bit slower than pure memory cache.
❯ \time -f 'Elapsed=%E' ./calculate_average_yourwass.sh > /dev/null
Elapsed=0:01.07

My implementation results

# First run:
\time -f 'Elapsed=%E' target/release/one-brc-rs  > /dev/null
Inside main total duration: 2.781221427s
Elapsed=0:02.96

# After few runs (so cache is now warm):
\time -f 'Elapsed=%E' target/release/one-brc-rs  > /dev/null
Inside main total duration: 705.814095ms
Elapsed=0:00.81

# With a RAM disk performances are constant but a little bit slower than pure memory cache.
❯ \time -f 'Elapsed=%E' target/release/one-brc-rs > /dev/null
Inside main total duration: 892.661661ms
Elapsed=0:00.97

Inspiration and ideas

What works:

  1. I start from mtb0x1 implementation
  2. I have optimize a lot the line parsing using GodBolt compiler explorer (so is is only 46 assembly instruction now 😎).
  3. I compute fxhash::hash64(...) only once and use raw hashbrown::HashTable.
  4. I use fixed point number computation only at the last moment.
  5. I have update thread stack size to have more CPU cache for the data.
  6. I do not munmap the data a let the OS releasing the memory.

What does not work:

Ideas: