denoland / deno_core

The core engine at the heart of Deno
MIT License
292 stars 95 forks source link

op2: Investigate `#[string]` performance for LARGE_1000000 #44

Open mmastrac opened 1 year ago

mmastrac commented 1 year ago

op2 strings are faster in every case other than LARGE_1000000 (1,000,000 ASCII characters). We need to investigate why.

test baseline                               ... bench:         878 ns/iter (+/- 96)
test bench_op_option_u32                    ... bench:      49,690 ns/iter (+/- 22,588)
test bench_op_string                        ... bench:      18,925 ns/iter (+/- 2,256)
test bench_op_string_large_1000             ... bench:     297,396 ns/iter (+/- 41,330)
test bench_op_string_large_1000000          ... bench:   2,622,869 ns/iter (+/- 298,615)
test bench_op_string_large_utf8_1000        ... bench:   3,946,605 ns/iter (+/- 403,230)
test bench_op_string_large_utf8_1000000     ... bench:  38,985,146 ns/iter (+/- 2,266,213)
test bench_op_string_old                    ... bench:      19,870 ns/iter (+/- 2,354)
test bench_op_string_old_large_1000         ... bench:     246,036 ns/iter (+/- 40,192)
test bench_op_string_old_large_1000000      ... bench:   1,082,275 ns/iter (+/- 104,487)
test bench_op_string_old_large_utf8_1000    ... bench:   5,485,882 ns/iter (+/- 489,366)
test bench_op_string_old_large_utf8_1000000 ... bench:  51,652,968 ns/iter (+/- 3,158,678)
test bench_op_string_option_u32             ... bench:      82,449 ns/iter (+/- 10,669)
test bench_op_u32                           ... bench:       4,508 ns/iter (+/- 575)
test bench_op_void                          ... bench:       5,054 ns/iter (+/- 419)
littledivy commented 1 year ago

@mmastrac It seems this has been fixed in main?

test bench_op_string_large_utf8_1000000     ... bench:  15,772,187 ns/iter (+/- 462,927)
...
test bench_op_string_old_large_utf8_1000000 ... bench:  20,796,803 ns/iter (+/- 354,834)
mmastrac commented 1 year ago

The UTF8 one is faster with op2, but for some reason the ASCII one is not. I think the benchmark has improved on main but is still slower (I think ~50%?).

Trimmed recent benchmark:

test bench_op_string_large_1000000          ... bench:     790,843 ns/iter (+/- 28,126)
test bench_op_string_old_large_1000000      ... bench:     471,671 ns/iter (+/- 70,252)

I wonder if we're just falling off some SIMD/autovectorization fast path?

littledivy commented 1 year ago

Ah ok, here's the profile for each one:

bench_op_string_large_1000000 - https://share.firefox.dev/44fuKRX bench_op_string_old_large_1000000 - https://share.firefox.dev/44WKIBf

It seems the fast call path is not taken in either of the cases and the other difference is that the old one uses WriteUtf8 whereas we use WriteOneByte