Closed LunarLambda closed 3 months ago
Here are my results:
$ cargo bench grapheme -- --save-baseline master
Finished bench [optimized + debuginfo] target(s) in 0.01s
Running src/bench.rs (/home/andrew/code/rust/bstr/target/release/deps/bstr-858b8ba21e5e6585)
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.
Gnuplot not found, using plotters backend
bstr/graphemes/en-small-ascii
time: [7.4085 µs 7.4276 µs 7.4468 µs]
thrpt: [128.06 MiB/s 128.40 MiB/s 128.73 MiB/s]
bstr/graphemes/en-tiny-ascii
time: [213.56 ns 214.09 ns 214.66 ns]
thrpt: [177.71 MiB/s 178.18 MiB/s 178.62 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
bstr/graphemes/ru-small-utf8
time: [16.082 µs 16.105 µs 16.129 µs]
thrpt: [59.127 MiB/s 59.215 MiB/s 59.301 MiB/s]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
bstr/graphemes/ru-tiny-utf8
time: [642.86 ns 644.87 ns 646.70 ns]
thrpt: [58.987 MiB/s 59.155 MiB/s 59.339 MiB/s]
bstr/graphemes/zh-small-utf8
time: [14.548 µs 14.601 µs 14.655 µs]
thrpt: [60.128 MiB/s 60.350 MiB/s 60.570 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
bstr/graphemes/zh-tiny-utf8
time: [625.13 ns 627.68 ns 630.19 ns]
thrpt: [60.533 MiB/s 60.775 MiB/s 61.023 MiB/s]
unicode-segmentation/graphemes/en-small-ascii
time: [4.4705 µs 4.4811 µs 4.4922 µs]
thrpt: [212.29 MiB/s 212.82 MiB/s 213.33 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high mild
unicode-segmentation/graphemes/en-tiny-ascii
time: [181.43 ns 182.09 ns 182.76 ns]
thrpt: [208.72 MiB/s 209.50 MiB/s 210.25 MiB/s]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
unicode-segmentation/graphemes/ru-small-utf8
time: [3.3936 µs 3.4015 µs 3.4096 µs]
thrpt: [279.70 MiB/s 280.37 MiB/s 281.02 MiB/s]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
unicode-segmentation/graphemes/ru-tiny-utf8
time: [113.33 ns 113.89 ns 114.45 ns]
thrpt: [333.31 MiB/s 334.93 MiB/s 336.59 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
unicode-segmentation/graphemes/zh-small-utf8
time: [2.0096 µs 2.0141 µs 2.0189 µs]
thrpt: [436.46 MiB/s 437.51 MiB/s 438.50 MiB/s]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
unicode-segmentation/graphemes/zh-tiny-utf8
time: [78.885 ns 79.149 ns 79.432 ns]
thrpt: [480.25 MiB/s 481.96 MiB/s 483.58 MiB/s]
Found 6 outliers among 100 measurements (6.00%)
2 (2.00%) low mild
3 (3.00%) high mild
1 (1.00%) high severe
And in a much easier to read format:
$ critcmp master -g '[^/]+/(.*)'
group master/bstr/ master/unicode-segmentation/
----- ------------ ----------------------------
graphemes/en-small-ascii 1.65 7.4±0.08µs 128.4 MB/sec 1.00 4.5±0.06µs 212.3 MB/sec
graphemes/en-tiny-ascii 1.17 213.9±2.37ns 178.4 MB/sec 1.00 183.1±3.50ns 208.3 MB/sec
graphemes/ru-small-utf8 4.71 16.1±0.11µs 59.3 MB/sec 1.00 3.4±0.04µs 279.7 MB/sec
graphemes/ru-tiny-utf8 5.59 642.4±6.58ns 59.4 MB/sec 1.00 115.0±5.01ns 331.8 MB/sec
graphemes/zh-small-utf8 7.23 14.6±0.29µs 60.5 MB/sec 1.00 2.0±0.03µs 436.9 MB/sec
graphemes/zh-tiny-utf8 7.88 625.9±11.72ns 60.9 MB/sec 1.00 79.4±2.09ns 480.3 MB/sec
So this crate's implementation is universally slower, but it's not 5-7x slower. It's 1.2x-8x slower. Which means it depends on the specific workload.
I personally don't have any plans to work on this in the immediate future, but other folks are welcome to work on it. If you think it will take significant changes to fix, we should discuss first. But there are perhaps some easy wins. Not sure.
The benchmark also panicked partway through:
Benchmarking bstr/for_byte_line/ascii: Warming up for 3.0000 sthread 'main' panicked at bench/src/bench.rs:257:13: assertion failed: count > 0
This is fixed on master
. I'm not sure how it ever worked? Weird.
Other benchmarks also tended to trail ~3-4x behind
std
, except forto_str
andto_str_lossy_valid
.
... and except for at least several find
benchmarks where bstr
is 30x-50x faster.
[andrew@duff bench]$ critcmp master -g '[^/]+/(.*)' -f '/find/'
group master/bstr/ master/std/
----- ------------ -----------
find/pathological/repeated-huge 1.00 7.3±0.01µs 63.4 GB/sec 39.33 288.9±0.63µs 1650.6 MB/sec
find/pathological/repeated-small 1.00 45.0±0.31ns 20.7 GB/sec 13.23 595.9±1.65ns 1602.0 MB/sec
find/rare/en-huge-ascii 1.00 7.7±0.01µs 61.9 GB/sec 26.14 201.5±1.58µs 2.4 GB/sec
find/rare/en-small-ascii 1.00 68.3±1.46ns 13.6 GB/sec 1.57 107.4±1.48ns 8.7 GB/sec
find/verycommon1/en-huge-ascii 1.00 343.2±2.01µs 1422.6 MB/sec 2.06 706.6±5.53µs 691.0 MB/sec
find/verycommon1/en-small-ascii 1.40 719.3±10.13ns 1325.9 MB/sec 1.00 514.1±8.27ns 1855.0 MB/sec
find/verycommon1/en-tiny-ascii 1.28 32.6±0.24ns 1168.6 MB/sec 1.00 25.5±0.39ns 1497.0 MB/sec
find/verycommon2/en-huge-ascii 1.00 7.8±1.15µs 61.0 GB/sec 49.53 387.1±13.35µs 1261.4 MB/sec
find/verycommon2/en-small-ascii 1.00 37.5±1.96ns 24.8 GB/sec 6.91 259.3±5.15ns 3.6 GB/sec
find/verycommon2/en-tiny-ascii 1.12 21.9±0.05ns 1744.3 MB/sec 1.00 19.6±0.40ns 1948.5 MB/sec
Otherwise I'm not really sure what to do with this issue. It seems more like a general complaint "hey some things aren't as fast as std" instead of something actionable. So I'm going to close it. And note that some benchmarks are likely impossible for bstr
to do better on. For example, chars()
. In std
, chars()
is defined on a &str
which is known to be valid UTF-8. But in bstr
, the string is not known to be valid UTF-8, so there is necessarily more work that needs to be done. Does this mean bstr
is the fastest possible given those constraints? No, certainly not.
Came over from https://github.com/unicode-rs/unicode-segmentation/issues/46 looking for something that can do single pass lossy UTF-8 decode and grapheme iteration. Thank you for your efforts, of course.
Unfortunately it's quite significantly behind unicode-segmentation in performance, despite the benchmarks using only valid UTF-8 inputs.
Benchmarks were done on a Debian 12 Linux machine with an Intel i5-7500T CPU and 16 GB of RAM, using stable rustc 1.75.0 graphemes_bench.txt
Other benchmarks also tended to trail ~3-4x behind
std
, except forto_str
andto_str_lossy_valid
.The benchmark also panicked partway through: