1️⃣🐝🏎️ The One Billion Row Challenge in Rust 🦀

My own implementation of the 1BRC but in Rust.

Test result on my local machine

Test are run under WSL2 (so using Hyper-V).

Hardware:

CPU: AMD Ryzen 9 7950X 16-Core Processor (with 32 logical processors)
RAM: 32G
Disk: NVME

Before running anything. There is a huge performance difference when data are in memory cache or not. So for every result, I will separate first run and real benchmark.

NOTE: I am cleaning memory cache using following command as root:

sync; echo 1 > /proc/sys/vm/drop_caches

Few utilities to deal with this can be found under the utils directory.

Java implementation

I do not have GraalVm, so I will only run the best pure Java implementation.

# First run:
❯ \time -f 'Elapsed=%E' ./calculate_average_yourwass.sh > /dev/null
Elapsed=0:03.14

# After few runs (so cache is now warm):
❯ \time -f 'Elapsed=%E' ./calculate_average_yourwass.sh > /dev/null
Elapsed=0:00.93

# With a RAM disk performances are constant but a little bit slower than pure memory cache.
❯ \time -f 'Elapsed=%E' ./calculate_average_yourwass.sh > /dev/null
Elapsed=0:01.07

My implementation results

# First run:
\time -f 'Elapsed=%E' target/release/one-brc-rs  > /dev/null
Inside main total duration: 2.781221427s
Elapsed=0:02.96

# After few runs (so cache is now warm):
\time -f 'Elapsed=%E' target/release/one-brc-rs  > /dev/null
Inside main total duration: 705.814095ms
Elapsed=0:00.81

# With a RAM disk performances are constant but a little bit slower than pure memory cache.
❯ \time -f 'Elapsed=%E' target/release/one-brc-rs > /dev/null
Inside main total duration: 892.661661ms
Elapsed=0:00.97

Inspiration and ideas

What works:

I start from mtb0x1 implementation
I have optimize a lot the line parsing using GodBolt compiler explorer (so is is only 46 assembly instruction now 😎).
I compute fxhash::hash64(...) only once and use raw hashbrown::HashTable.
I use fixed point number computation only at the last moment.
I have update thread stack size to have more CPU cache for the data.
I do not munmap the data a let the OS releasing the memory.

What does not work:

Changing global allocator: using strace we can check no allocation are done after file is memory mapped. Tests have been done with mimalloc and jemalloc.
Manual line split: rayon crate does the job pretty well.
Updating mmap flags.
Manually setting core_affinity: it looks like the OS does a pretty good job.

Ideas:

Using custom HashMap?
I have created a branch with dedicated avx512 byte buffer usage. It did not work but there is some good ideas to keep IMHO.

arthurlm / one-brc-rs

readme

1️⃣🐝🏎️ The One Billion Row Challenge in Rust 🦀

Test result on my local machine

Java implementation

My implementation results

Inspiration and ideas