Open Robbepop opened 3 months ago
Do you have a link to the module with the largest regression? That'd probably be the easiest to start with in terms of evaluating validation performance on that module and seeing what sticks out in perf
Do you mean Wasm module?
From what I saw tiny_keccak
is a good contender. But there were others.
To benchmark tiny_keccak
translation test cases in Wasmi, use:
cargo bench --bench benches translate/tiny_keccak/
The tiny_keccak
benchmark can be found here:
https://github.com/wasmi-labs/rust-benchmarks/tree/main/cases/tiny_keccak
I just found one reason why it regressed: I benchmarked without setting the no-hash-maps
feature. Generally BTreeMap
is faster than HashMap
for Wasmi uses and the old wasmparser-nostd
fork alsways uses BTreeMap
s. With this feature set local benchmarks still indicate regressions but they are less severe, ranging between 10-25% instead of 15-35%. The unchecked (no validation) case now even has slightly improved performance which indicates very strongly to me that the regressions that we see come from the Wasm validation.
Ok I've been poking at this a bit and so far what I've found is that #701, the first commit after 0.100.0, led to a precipitous drop in performance. That was known at the time however. There are other smaller regressions here and there but they're looking to be more difficult to pin down.
Profiling the code itself does not show any unexpected hot spots. That means that it's just sort of spread out a lot.
Doing various tweaks seems to help here and there but nothing is a major improvement. For example outlining string-based errors to their own #[cold]
function sometimes helps and sometimes doesn't.
One improvement that has shown promise is skipping initialization checking for locals entirely. This was first added in the function-references proposal and later proposals like gc rely on it as well. This yields a ~5% gain (ish) but requires completely removing the check. That means that only way to achieve this would be to add compile-time features to wasmparser.
Overall my guess is that the wasm language has just gotten more complicated over time which has slowed down the validator. I don't know of a great way to handle this other than to add a bunch of Cargo compile-time features for each wasm proposal. Doing that is not something I think would be easy to design but I think would provide the wins necessary here.
Thank you for the investigation @alexcrichton !
If your assessment is correct I have to admit that it is a bit worrysome to me that the Wasm spec seems to develop into a direction that may not be suitable for all of its original users and stake holders.
I remember that you said that function-references
are going to incur a performance overhead and in hindsight I was kinda optimistic that this performance penalty will surely be dealt with over time.
My assumption of how Wasm features are intended to work was that they should (ideally) not significantly influence the performance if they are unused. So only if I actually were to use the function-references
proposal in a significant way in my Wasm blob, I'd "suffer" from the performance penalties that are associated to its additional complexity. Though, I totally understand that it is probably already hard enough to keep up with all the Wasm developments.
Conditional compilation wouldn't solve the problem from my point of view since eventually all Wasm runtimes (including Wasmi) want to adopt all stabilized features anyways. The Wasm standard is not modularised at the moment and thus does not allow to be picky.
I just ran flamegraph
on the translate/tiny_keccak/checked/lazy-translation
test case:
main
:
PR:
From what it looks I can confirm your assessment that multiple parts of the wasmparser
validation pipeline seem to have regressed performance. However, especially notable parts (from the looks) are:
Validator::type_section
ModuleParser::process_end
VisitOperator::visit_local_get
VisitOperator::visit_end
One improvement that has shown promise is skipping initialization checking for locals entirely.
I forgot to ask last time: What exactly does this mean? What is initialization checking of locals and where can I find the code?
Looking just at translate_operators
we can also see some significant shifts:
main
:
PR:
In my last performance assessment I made a mistake by not also enabling no-hash-maps
for the main
branch when conducting benchmarks between main
and the PR which made the results look better for the PR than they actually are. The corrected results look even worse:
We see a performance regression between 15-45%. Still, unchecked
workloads without Wasm validation are not regressed.
I would be hesitant to generalize wasm-tools's implementation to the entire spec development and other engines in the sense that we could just perhaps be bitten by our own decisions. This could be wasmparser-specific and not affect other implementations, I'm not sure. When I've benchmarked historically, for example, I believe that v8's validator (via node
) has consistently been faster than wasmparser's. Basically I don't think wasmparser was ever at the peak of validation performance.
I also saw similar hot spots as you locally, but I was unable to figure out how best to optimize them and local attempts weren't bearing much fruit.
The initialization checking looks like this. That's an array lookup plus a boolean test of what's in the array. My guess is that everything is a well-predicted branch and in-cache since locals typically go to the same local. The 5% perf win I got (ish) was removing those checks entirely. That's not spec-compliant so we can't land it but it was the best I could do.
You are right that this might very well be just an implementation detail that can be fixed.
Thanks a lot for the link to the local.get
checks, I will have a look at the codebase to see whether it is possible to further apply some optimizations without removing those checks. E.g. I could imagine that we could use in-place bitfields for up to 32 (or 64) locals (common case) instead of Vec<bool>
etc.. Maybe there are more things that could be improved. I will experiment with this a bit and come back.
If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].
So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).
[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352 [2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285
If it helps, here is the implementation in SpiderMonkey [1]. We do an upfront check for if the local index is less than where the first non-nullable local is and skip any boolean checking after that. Then we consult a bitmap. For setting a local, we also skip any consultation of the bitmap in the case where the local has been set [2].
So for applications that are not using any GC types, the only cost on local.get/set is the upfront check against the first non-nullable local index (which will always fail).
[1] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#347-352 [2] https://searchfox.org/mozilla-central/rev/dd421ae14997e3bebb9c08634633a4a3e3edeffc/js/src/wasm/WasmOpIter.h#2283-2285
Thanks so much for sharing this information! That's invaluable. :)
Indeed thanks @eqrion! I suspect that both SpiderMonkey (and possibly v8 too) would be good sources of inspiration for algorithmic changes to the validator in wasmparser.
Around locals specifically another thing that wasmparser does is it doesn't maintain Vec<ValType>
for locals and instead has a scheme where the first N locals are stored in Vec<ValType>
but afterwards they're stored in Vec<(u32, ValType)>
for a more compressed representation like wasm has. This representation is probably too-clever-by-half and probably isn't really benefitting all that much. Simplifying that to just Vec<ValType>
would probably help clean up the code a bit and remove a branch here and there.
Another possibility is that there is currently zero unsafe
code in the validator which while is a great property I don't personally consider set in stone forever necessarily. For example using the idea from SpiderMonkey checking for local initialization the first index of a non-nullable local could be the number of locals in a function if they're all nullable. That way after that "bounds check" the access into the actual types array could be done in an unchecked fashion to have just a single condition in the validation of local.get
. This could also very well be "too clever by half" and introducing unsafe
code is definitely something we'd want to do carefully.
I updated the Wasmi PR to update wasmparser
to v0.218.0
and improved/extended translation benchmarks and reran them to get a better understanding of where performance regressions are located and could be fixed.
Despite the performance regressions I consider merging the PR to finally make use of the newer wasmparser
versions but I am keen on fixing the performance regressions, especially for the important lazy
modes.
Link: https://github.com/wasmi-labs/wasmi/pull/1141#issuecomment-2395462647
FWIW I'm poking around in https://github.com/bytecodealliance/wasm-tools/pull/1845 about slimming down wasmparser optionally at compile-time and one thing I was thinking of was adding a GC-like feature which might help recover some performance here. I'm not 100% sure it'll work though nor if it's worth it, but figured it's worth poking around at least.
Oh thanks a lot for the information! I like #1845 a lot even if it won't improve performance, reducing compile times is also a huge gain already. :)
I opened https://github.com/bytecodealliance/wasm-tools/pull/1870 for minor gains in the code
section validation.
However, as the Wasmi benchmarks suggest, most of the performance regressions are in the Wasm validation parts outside the code
section. I have to dig deeper to see which parts of the Wasm validator are the main offenders. @alexcrichton any clues?
@alexcrichton I reran flamegraph
on current Wasmi main
compared to the PR to update wasmparser
to find out the most regressing part of wasmparser
since v0.100.0
.
It is the wasmparser::Validator::type_section
call that regressed by over 130%, thus requiring more than double the time compared to v0.100.0(
wasmparser-nostd` fork).
From what I can see, the following sub-parts take the most time in the benchmarks:
TypeList::intern_canonical_rec_group
(2.48%)<RecGroup as FromReader>::from_reader
(2.15%)<RecGroup as Ord>::cmp
(1.06%)InternRecGroup::check_subtype
(0.88%)All of which seem to be related to the gc
Wasm proposal.
In comparison the sub-parts that take the most time for wasmparser v0.100.0
are:
<FuncType as FromReader>::from_reader
(1.24%)Module::add_type
(0.82%)It seems that the (computationally expensive) recgroup
canonicalization and internment is always performed, even for type sections that do not have recursive definitions - which is the overwhelmingly common case, especially for non-GC wasm. If this is true, I think we could optimize this by only doing this canonicalization when necessary.
I.e. use an efficient way to register types that also detects (potentially) recursive references within them, and only use the expensive recgroup
canonicalization and internment when needed as a fallback. If a detection of potentially recursive references cannot be implemented easily we could also simply use the WasmFeatures::GC
(or WasmFeatures::GC_TYPES
) flag.
This way users only pay for what they use.
Thanks for measuring this @Robbepop! That intuitively makes sense to me and we historically have spent most of our time looking at optimizing the operator validator rather than the general module validator.
Would you be able to extract an example module or two that showcases this slowdown? For example a module that only has a type section should be sufficient to showcase the slowdown and additionally track improvements.
(sorry still recovering from jetlag so responding to the rest of your comment now too)
I think we could optimize this by only doing this canonicalization when necessary
Makes sense to me! I'd prefer to lean on "detect a recursive type" if possible and only use WasmFeatures
as a last resort as well, but I agree that the ultimate decision maker there will be benchmarks one way or another.
Would you be able to extract an example module or two that showcases this slowdown? For example a module that only has a type section should be sufficient to showcase the slowdown and additionally track improvements.
There already is a lots-of-types.wasm
benchmark test case in the wasmparser
crate. I took a look at it and it consists of tons of empty (func)
types, which probably won't take a burden on the type section validation. I would rename this test case to lots-of-empty-fn-types.wasm
and add another with lots of unique function types, named lots-of-unique-fn-types.wasm
.
Makes sense to me! I'd prefer to lean on "detect a recursive type" if possible and only use WasmFeatures as a last resort as well, but I agree that the ultimate decision maker there will be benchmarks one way or another.
I too think it is better not to use the WasmFeatures
if possible. Another idea I had, was to make internment and canonicalization a lazy process and only ever do whatever is computationally expensive just when it is actually needed for a particular type. In contrast to the rec
-type detection idea, this kinda turns the eager validation into a lazy one. Not sure if possible, but I think this would be an even cleaner design.
All that sounds reasonable to me yeah. I haven't dug into why this is so much slower than before myself so I can't speak too much as to whether it seems like the best route forward, but I'm happy to defer to you as well.
So far I have only digged into <RecGroup as FromReader>::from_reader
regressions. I tried to lightly modify the code, e.g. with some annotations, inline, #[cold]
, extracting functions, putting common cases in front but without any major performance wins while the code became much uglier, thus I decided not to open a PR.
Due to technical reasons I had to use the (by now) ancient
wasmparser 0.100.2
in Wasmi and was using a fork calledwasmparser-nostd
. The fork had no significant impact on the performance.Recently I updated Wasmi to use the newest
wasmparser v0.214.0
and unfortunately saw massive performance regressions, especially for Wasm validation. Many Wasmi benchmarks regressed by 15-35% and some even by 45% compared to the Wasmi that uses the ancientwasmparser v0.100.2
.The Wasmi upgrade PR is here: https://github.com/wasmi-labs/wasmi/pull/1141
Maybe I did some mistakes when upgrading which explain the performance regressions.
Since Wasmi is very focused on speedy translation times those performance regressions unfortunately are a show stopper for using the new
wasmparser
versions.