Grapheme segmentation is 1.2x-8x slower than `unicode-segmentation` in benchmarks

LunarLambda commented 3 months ago

Came over from https://github.com/unicode-rs/unicode-segmentation/issues/46 looking for something that can do single pass lossy UTF-8 decode and grapheme iteration. Thank you for your efforts, of course.

Unfortunately it's quite significantly behind unicode-segmentation in performance, despite the benchmarks using only valid UTF-8 inputs.

Benchmarks were done on a Debian 12 Linux machine with an Intel i5-7500T CPU and 16 GB of RAM, using stable rustc 1.75.0 graphemes_bench.txt

Other benchmarks also tended to trail ~3-4x behind std, except for to_str and to_str_lossy_valid.

The benchmark also panicked partway through:

Benchmarking bstr/for_byte_line/ascii: Warming up for 3.0000 sthread 'main' panicked at bench/src/bench.rs:257:13: assertion failed: count > 0

BurntSushi commented 3 months ago

Here are my results:

$ cargo bench grapheme -- --save-baseline master
    Finished bench [optimized + debuginfo] target(s) in 0.01s
     Running src/bench.rs (/home/andrew/code/rust/bstr/target/release/deps/bstr-858b8ba21e5e6585)
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
bstr/graphemes/en-small-ascii
                        time:   [7.4085 µs 7.4276 µs 7.4468 µs]
                        thrpt:  [128.06 MiB/s 128.40 MiB/s 128.73 MiB/s]

bstr/graphemes/en-tiny-ascii
                        time:   [213.56 ns 214.09 ns 214.66 ns]
                        thrpt:  [177.71 MiB/s 178.18 MiB/s 178.62 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

bstr/graphemes/ru-small-utf8
                        time:   [16.082 µs 16.105 µs 16.129 µs]
                        thrpt:  [59.127 MiB/s 59.215 MiB/s 59.301 MiB/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

bstr/graphemes/ru-tiny-utf8
                        time:   [642.86 ns 644.87 ns 646.70 ns]
                        thrpt:  [58.987 MiB/s 59.155 MiB/s 59.339 MiB/s]

bstr/graphemes/zh-small-utf8
                        time:   [14.548 µs 14.601 µs 14.655 µs]
                        thrpt:  [60.128 MiB/s 60.350 MiB/s 60.570 MiB/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

bstr/graphemes/zh-tiny-utf8
                        time:   [625.13 ns 627.68 ns 630.19 ns]
                        thrpt:  [60.533 MiB/s 60.775 MiB/s 61.023 MiB/s]

unicode-segmentation/graphemes/en-small-ascii
                        time:   [4.4705 µs 4.4811 µs 4.4922 µs]
                        thrpt:  [212.29 MiB/s 212.82 MiB/s 213.33 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

unicode-segmentation/graphemes/en-tiny-ascii
                        time:   [181.43 ns 182.09 ns 182.76 ns]
                        thrpt:  [208.72 MiB/s 209.50 MiB/s 210.25 MiB/s]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

unicode-segmentation/graphemes/ru-small-utf8
                        time:   [3.3936 µs 3.4015 µs 3.4096 µs]
                        thrpt:  [279.70 MiB/s 280.37 MiB/s 281.02 MiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

unicode-segmentation/graphemes/ru-tiny-utf8
                        time:   [113.33 ns 113.89 ns 114.45 ns]
                        thrpt:  [333.31 MiB/s 334.93 MiB/s 336.59 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

unicode-segmentation/graphemes/zh-small-utf8
                        time:   [2.0096 µs 2.0141 µs 2.0189 µs]
                        thrpt:  [436.46 MiB/s 437.51 MiB/s 438.50 MiB/s]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

unicode-segmentation/graphemes/zh-tiny-utf8
                        time:   [78.885 ns 79.149 ns 79.432 ns]
                        thrpt:  [480.25 MiB/s 481.96 MiB/s 483.58 MiB/s]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

And in a much easier to read format:

$ critcmp master -g '[^/]+/(.*)'
group                       master/bstr/                           master/unicode-segmentation/
-----                       ------------                           ----------------------------
graphemes/en-small-ascii    1.65      7.4±0.08µs   128.4 MB/sec    1.00      4.5±0.06µs   212.3 MB/sec
graphemes/en-tiny-ascii     1.17    213.9±2.37ns   178.4 MB/sec    1.00    183.1±3.50ns   208.3 MB/sec
graphemes/ru-small-utf8     4.71     16.1±0.11µs    59.3 MB/sec    1.00      3.4±0.04µs   279.7 MB/sec
graphemes/ru-tiny-utf8      5.59    642.4±6.58ns    59.4 MB/sec    1.00    115.0±5.01ns   331.8 MB/sec
graphemes/zh-small-utf8     7.23     14.6±0.29µs    60.5 MB/sec    1.00      2.0±0.03µs   436.9 MB/sec
graphemes/zh-tiny-utf8      7.88   625.9±11.72ns    60.9 MB/sec    1.00     79.4±2.09ns   480.3 MB/sec

So this crate's implementation is universally slower, but it's not 5-7x slower. It's 1.2x-8x slower. Which means it depends on the specific workload.

I personally don't have any plans to work on this in the immediate future, but other folks are welcome to work on it. If you think it will take significant changes to fix, we should discuss first. But there are perhaps some easy wins. Not sure.

BurntSushi commented 3 months ago

The benchmark also panicked partway through:

Benchmarking bstr/for_byte_line/ascii: Warming up for 3.0000 sthread 'main' panicked at bench/src/bench.rs:257:13: assertion failed: count > 0

This is fixed on master. I'm not sure how it ever worked? Weird.

BurntSushi commented 3 months ago

Other benchmarks also tended to trail ~3-4x behind std, except for to_str and to_str_lossy_valid.

... and except for at least several find benchmarks where bstr is 30x-50x faster.

[andrew@duff bench]$ critcmp master -g '[^/]+/(.*)' -f '/find/'
group                               master/bstr/                           master/std/
-----                               ------------                           -----------
find/pathological/repeated-huge     1.00      7.3±0.01µs    63.4 GB/sec    39.33   288.9±0.63µs  1650.6 MB/sec
find/pathological/repeated-small    1.00     45.0±0.31ns    20.7 GB/sec    13.23   595.9±1.65ns  1602.0 MB/sec
find/rare/en-huge-ascii             1.00      7.7±0.01µs    61.9 GB/sec    26.14   201.5±1.58µs     2.4 GB/sec
find/rare/en-small-ascii            1.00     68.3±1.46ns    13.6 GB/sec    1.57    107.4±1.48ns     8.7 GB/sec
find/verycommon1/en-huge-ascii      1.00    343.2±2.01µs  1422.6 MB/sec    2.06    706.6±5.53µs   691.0 MB/sec
find/verycommon1/en-small-ascii     1.40   719.3±10.13ns  1325.9 MB/sec    1.00    514.1±8.27ns  1855.0 MB/sec
find/verycommon1/en-tiny-ascii      1.28     32.6±0.24ns  1168.6 MB/sec    1.00     25.5±0.39ns  1497.0 MB/sec
find/verycommon2/en-huge-ascii      1.00      7.8±1.15µs    61.0 GB/sec    49.53  387.1±13.35µs  1261.4 MB/sec
find/verycommon2/en-small-ascii     1.00     37.5±1.96ns    24.8 GB/sec    6.91    259.3±5.15ns     3.6 GB/sec
find/verycommon2/en-tiny-ascii      1.12     21.9±0.05ns  1744.3 MB/sec    1.00     19.6±0.40ns  1948.5 MB/sec

Otherwise I'm not really sure what to do with this issue. It seems more like a general complaint "hey some things aren't as fast as std" instead of something actionable. So I'm going to close it. And note that some benchmarks are likely impossible for bstr to do better on. For example, chars(). In std, chars() is defined on a &str which is known to be valid UTF-8. But in bstr, the string is not known to be valid UTF-8, so there is necessarily more work that needs to be done. Does this mean bstr is the fastest possible given those constraints? No, certainly not.

BurntSushi / bstr

Grapheme segmentation is 1.2x-8x slower than `unicode-segmentation` in benchmarks #179