op2: Investigate `#[string]` performance for LARGE_1000000

mmastrac commented 1 year ago

op2 strings are faster in every case other than LARGE_1000000 (1,000,000 ASCII characters). We need to investigate why.

test baseline                               ... bench:         878 ns/iter (+/- 96)
test bench_op_option_u32                    ... bench:      49,690 ns/iter (+/- 22,588)
test bench_op_string                        ... bench:      18,925 ns/iter (+/- 2,256)
test bench_op_string_large_1000             ... bench:     297,396 ns/iter (+/- 41,330)
test bench_op_string_large_1000000          ... bench:   2,622,869 ns/iter (+/- 298,615)
test bench_op_string_large_utf8_1000        ... bench:   3,946,605 ns/iter (+/- 403,230)
test bench_op_string_large_utf8_1000000     ... bench:  38,985,146 ns/iter (+/- 2,266,213)
test bench_op_string_old                    ... bench:      19,870 ns/iter (+/- 2,354)
test bench_op_string_old_large_1000         ... bench:     246,036 ns/iter (+/- 40,192)
test bench_op_string_old_large_1000000      ... bench:   1,082,275 ns/iter (+/- 104,487)
test bench_op_string_old_large_utf8_1000    ... bench:   5,485,882 ns/iter (+/- 489,366)
test bench_op_string_old_large_utf8_1000000 ... bench:  51,652,968 ns/iter (+/- 3,158,678)
test bench_op_string_option_u32             ... bench:      82,449 ns/iter (+/- 10,669)
test bench_op_u32                           ... bench:       4,508 ns/iter (+/- 575)
test bench_op_void                          ... bench:       5,054 ns/iter (+/- 419)

littledivy commented 1 year ago

@mmastrac It seems this has been fixed in main?

test bench_op_string_large_utf8_1000000     ... bench:  15,772,187 ns/iter (+/- 462,927)
...
test bench_op_string_old_large_utf8_1000000 ... bench:  20,796,803 ns/iter (+/- 354,834)

mmastrac commented 1 year ago

The UTF8 one is faster with op2, but for some reason the ASCII one is not. I think the benchmark has improved on main but is still slower (I think ~50%?).

Trimmed recent benchmark:

test bench_op_string_large_1000000          ... bench:     790,843 ns/iter (+/- 28,126)
test bench_op_string_old_large_1000000      ... bench:     471,671 ns/iter (+/- 70,252)

I wonder if we're just falling off some SIMD/autovectorization fast path?

littledivy commented 1 year ago

Ah ok, here's the profile for each one:

bench_op_string_large_1000000 - https://share.firefox.dev/44fuKRX bench_op_string_old_large_1000000 - https://share.firefox.dev/44WKIBf

It seems the fast call path is not taken in either of the cases and the other difference is that the old one uses WriteUtf8 whereas we use WriteOneByte

denoland / deno_core

op2: Investigate `#[string]` performance for LARGE_1000000 #44