clockworklabs / SpacetimeDB

Multiplayer at the speed of light
https://spacetimedb.com
Other
4.41k stars 110 forks source link

Optimise C# in a ridiculous way #1924

Closed RReverser closed 3 weeks ago

RReverser commented 3 weeks ago

Description of Changes

Clickbait title, but there is no other way to say this. It must be the weirdest way I optimised anything yet.

I was suspicious that a lot of benchmarks, in particular those that deal with iteration - which we have for C# as of recently - take about the same amount of time. After digging in for a bit, I narrowed it down to merely presence of this destructor. Looks like the way destructors are implemented for the Wasm target somehow deoptimises the execution to the point where simply removing it speeds up benchmarks by 15x or more.

We don't really need it anyway, it was more of a nice-to-have in case user does something weird, since regular iterator usage like foreach always calls Dispose() correctly.

Further investigation and an upstream issue to .NET coming up, but meanwhile here are benchmark differences before/after:

Benchmark Speedup C# before C# rate before C# after C# rate after
special/db_game/circles/load=10 14.58 3.3±0.24ms ? ?/sec 224.1±21.57µs ? ?/sec
special/db_game/circles/load=100 14.93 3.3±0.20ms ? ?/sec 223.4±16.61µs ? ?/sec
stdb_module/mem/filter/string/index/load=2048/count=256 15.79 3.1±0.11ms 321 Elem/sec 197.0±2.97µs 5.0 KElem/sec
stdb_module/mem/filter/u64/index/load=2048/count=256 19.15 3.2±0.11ms 315 Elem/sec 165.3±2.86µs 5.9 KElem/sec
stdb_module/mem/insert_bulk/u32_u64_u64/btree_each_column/load=2048/count=256 2.00 3.2±0.11ms 314 Elem/sec 1589.3±45.07µs 629 Elem/sec
stdb_module/mem/insert_bulk/u32_u64_u64/unique_0/load=2048/count=256 16.71 3.2±0.10ms 312 Elem/sec 191.7±2.98µs 5.1 KElem/sec
stdb_module/mem/iterate/u32_u64_str/unique_0/count=256 31.72 3.1±0.11ms 325 Elem/sec 97.0±2.47µs 10.1 KElem/sec
stdb_module/mem/iterate/u32_u64_u64/unique_0/count=256 34.63 3.4±0.11ms 296 Elem/sec 97.5±2.40µs 10.0 KElem/sec
stdb_module/mem/update_bulk/u32_u64_str/unique_0/load=2048/count=256 14.31 3.1±0.12ms 318 Elem/sec 219.2±7.24µs 4.5 KElem/sec
stdb_module/mem/update_bulk/u32_u64_u64/unique_0/load=2048/count=256 15.02 3.2±0.06ms 315 Elem/sec 210.9±2.46µs 4.6 KElem/sec

API and ABI breaking changes

If this is an API or ABI breaking change, please apply the corresponding GitHub label.

Expected complexity level and risk

How complicated do you think these changes are? Grade on a scale from 1 to 5, where 1 is a trivial change, and 5 is a deep-reaching and complex change.

This complexity rating applies not only to the complexity apparent in the diff, but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR, and what other components it interacts with in potentially concerning ways.

Testing

Describe any testing you've done, and any testing you'd like your reviewers to do, so that you're confident that all the changes work as expected!

RReverser commented 3 weeks ago

How did you even find this?

It started from digging into why iterators specifically are so slow, making some optimisations, being weirded out by not noticing any improvement, and eventually ended with a "scientific guess" that if it all takes roughly constant amount of time, GC must be involved somehow. And voila, indeed it was.

And to be clear, is it removing the finalizer, or adding the sealed which has this effect?

It's removing the finalizer.

sealed was added after the fact, because after removing the finalizer I also removed GC.SuppressFinalize(this); - which is no longer necessary - and then linter would warn about subclasses potentially adding their own finalizers. So I made the class sealed just to tell the linter it's okay and there will be no subclasses.