clockworklabs / SpacetimeDB

Multiplayer at the speed of light
https://spacetimedb.com
Other
4.12k stars 100 forks source link

Use a custom `FixedBitSet` + optimize `Page::iter_fixed_len` #1160

Closed Centril closed 2 weeks ago

Centril commented 3 weeks ago

Description of Changes

Benchmarks as compared to HEAD~1 (branched master) on i7-7700K, 64GB RAM.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.1
                        time:   [1.2453 µs 1.2476 µs 1.2504 µs]
                        thrpt:  [4.9933 GiB/s 5.0044 GiB/s 5.0138 GiB/s]
                 change:
                        time:   [-90.640% -90.594% -90.557%] (p = 0.00 < 0.05)
                        thrpt:  [+958.94% +963.11% +968.42%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.25
                        time:   [3.1358 µs 3.1572 µs 3.1847 µs]
                        thrpt:  [4.8825 GiB/s 4.9251 GiB/s 4.9586 GiB/s]
                 change:
                        time:   [-84.648% -84.560% -84.426%] (p = 0.00 < 0.05)
                        thrpt:  [+542.11% +547.68% +551.37%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.5
                        time:   [5.4504 µs 5.4618 µs 5.4775 µs]
                        thrpt:  [5.6068 GiB/s 5.6230 GiB/s 5.6347 GiB/s]
                 change:
                        time:   [-84.835% -84.755% -84.652%] (p = 0.00 < 0.05)
                        thrpt:  [+551.56% +555.96% +559.43%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.75
                        time:   [7.6559 µs 7.6630 µs 7.6772 µs]
                        thrpt:  [5.9500 GiB/s 5.9610 GiB/s 5.9666 GiB/s]
                 change:
                        time:   [-71.593% -71.571% -71.540%] (p = 0.00 < 0.05)
                        thrpt:  [+251.37% +251.75% +252.03%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.9
                        time:   [9.0964 µs 9.1057 µs 9.1182 µs]
                        thrpt:  [6.0172 GiB/s 6.0255 GiB/s 6.0316 GiB/s]
                 change:
                        time:   [-61.702% -61.475% -61.299%] (p = 0.00 < 0.05)
                        thrpt:  [+158.39% +159.57% +161.11%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/1
                        time:   [10.002 µs 10.007 µs 10.012 µs]
                        thrpt:  [6.0903 GiB/s 6.0932 GiB/s 6.0960 GiB/s]
                 change:
                        time:   [-54.012% -53.966% -53.916%] (p = 0.00 < 0.05)
                        thrpt:  [+116.99% +117.23% +117.45%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.1
                        time:   [261.06 ns 261.42 ns 261.80 ns]
                        thrpt:  [24.816 GiB/s 24.853 GiB/s 24.887 GiB/s]
                 change:
                        time:   [-90.917% -90.870% -90.838%] (p = 0.00 < 0.05)
                        thrpt:  [+991.51% +995.34% +1001.0%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.25
                        time:   [757.95 ns 758.98 ns 760.35 ns]
                        thrpt:  [20.813 GiB/s 20.850 GiB/s 20.879 GiB/s]
                 change:
                        time:   [-74.311% -74.261% -74.208%] (p = 0.00 < 0.05)
                        thrpt:  [+287.72% +288.51% +289.27%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.5
                        time:   [1.3360 µs 1.3364 µs 1.3368 µs]
                        thrpt:  [22.628 GiB/s 22.636 GiB/s 22.642 GiB/s]
                 change:
                        time:   [-61.588% -61.510% -61.409%] (p = 0.00 < 0.05)
                        thrpt:  [+159.13% +159.81% +160.33%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.75
                        time:   [1.9166 µs 1.9174 µs 1.9183 µs]
                        thrpt:  [23.800 GiB/s 23.812 GiB/s 23.822 GiB/s]
                 change:
                        time:   [-50.428% -50.281% -50.131%] (p = 0.00 < 0.05)
                        thrpt:  [+100.53% +101.13% +101.73%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.9
                        time:   [2.2676 µs 2.2698 µs 2.2735 µs]
                        thrpt:  [24.133 GiB/s 24.173 GiB/s 24.196 GiB/s]
                 change:
                        time:   [-46.345% -46.272% -46.199%] (p = 0.00 < 0.05)
                        thrpt:  [+85.870% +86.123% +86.375%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/1
                        time:   [2.4979 µs 2.4998 µs 2.5028 µs]
                        thrpt:  [24.363 GiB/s 24.392 GiB/s 24.410 GiB/s]
                 change:
                        time:   [-39.172% -39.109% -39.052%] (p = 0.00 < 0.05)
                        thrpt:  [+64.074% +64.227% +64.397%]
                        Performance has improved.

iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.1
                        time:   [31.214 ns 31.237 ns 31.264 ns]
                        thrpt:  [175.40 GiB/s 175.55 GiB/s 175.68 GiB/s]
                 change:
                        time:   [-89.765% -89.756% -89.747%] (p = 0.00 < 0.05)
                        thrpt:  [+875.28% +876.16% +877.00%]
                        Performance has improved.

iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.25
                        time:   [67.296 ns 67.340 ns 67.424 ns]
                        thrpt:  [194.49 GiB/s 194.73 GiB/s 194.85 GiB/s]
                 change:
                        time:   [-79.814% -79.798% -79.777%] (p = 0.00 < 0.05)
                        thrpt:  [+394.50% +395.00% +395.38%]
                        Performance has improved.

iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.5
                        time:   [163.99 ns 164.26 ns 164.56 ns]
                        thrpt:  [188.35 GiB/s 188.69 GiB/s 189.00 GiB/s]
                 change:
                        time:   [-58.921% -58.858% -58.791%] (p = 0.00 < 0.05)
                        thrpt:  [+142.66% +143.06% +143.44%]
                        Performance has improved.
iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.75
                        time:   [247.31 ns 247.47 ns 247.78 ns]
                        thrpt:  [189.56 GiB/s 189.80 GiB/s 189.92 GiB/s]
                 change:
                        time:   [-46.170% -46.134% -46.080%] (p = 0.00 < 0.05)
                        thrpt:  [+85.460% +85.645% +85.770%]
                        Performance has improved.

iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.9
                        time:   [293.82 ns 294.04 ns 294.30 ns]
                        thrpt:  [187.95 GiB/s 188.11 GiB/s 188.25 GiB/s]
                 change:
                        time:   [-40.039% -39.889% -39.644%] (p = 0.00 < 0.05)
                        thrpt:  [+65.685% +66.358% +66.776%]
                        Performance has improved.

iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/1
                        time:   [318.79 ns 319.17 ns 319.92 ns]
                        thrpt:  [190.04 GiB/s 190.49 GiB/s 190.71 GiB/s]
                 change:
                        time:   [-38.455% -38.353% -38.268%] (p = 0.00 < 0.05)
                        thrpt:  [+61.991% +62.213% +62.482%]
                        Performance has improved.

API and ABI breaking changes

None

Expected complexity level and risk

2, contained changes, but some unsafe included.

Testing

Proptests are included and they have been run in miri

Centril commented 2 weeks ago

Why did you choose to make FixedBitSet generic over B: BitBlock, rather than just using u64 directly?

Initially, I started off with a type alias type BitBlock = u32;, but eventually I felt that B being generic helped enforce "unit correctness" and I also heard that James might want a custom bitset for other places.

Did you run any experiments with different block sizes?

Initially, I tried u32 just because that was the previous block size for BitVec. However, changing to u64 did give some slight improvements in benchmarks. They were not huge, but they were notable. I don't recall how significant they were however.

Do you foresee us wanting to choose a different block size?

Might be that James's use case wants u32 to be more dense, but u64 might also be all we want.

Shubham8287 commented 2 weeks ago

Just went through code, It was fun to read FixedBitSet code. I really appreciate how well the code is written. Thanks! :raised_hands: