Open plusvic opened 3 days ago
cc @orlp @Voultapher (authors of the new sort implementation)
Fascinating, thanks for tracking this down!
At least at a surface level, I'm not super-inclined to rewrite part of the register allocator based on a performance bug in std
; that feels like something that should be fixed in std
. Sorting of already almost-sorted data, or of two concatenated sorted subsequences, should be fast (and hopefully will return to being fast in the future, at which time we'd want to throw away whatever ad-hoc less-optimized custom workaround that we write). I'd encourage you to report the bug upstream though, if not already, and I'd be interested to hear what they think.
Hi, thanks @plusvic for raising the issue. Looking at the std library documentation:
It is typically faster than stable sorting, except in a few special cases, e.g., when the slice is partially sorted.
If your input is nearly sorted or all you want to do is merging, slice::sort
is the much better fit as explained in the documentation, it's a hybrid merge and quicksort so it can efficiently handle nearly sorted sub-slices.
To expand a little more, @orlp and I are aware of the append + sort use case and did include it in our design and benchmarking, under the pattern name random_s95, 95% sorted + 5% unsorted unstable sort design document and stable sort design document.
The previous implementation of slice::sort_unstable
did a partial insertion sort of up to 5
elements. So if the appended slice is at most 5 elements you get a pretty fast run-time. Something I suspect happened here. However this approach has a couple issues, it creates a rather arbitrary performance cliff, if you append 6 instead of 5 elements you suddenly get random pattern performance plus you wasted 5 near full scans of the input, as seen when looking at the number of comparisons performed:
The robust and generic fix is to run detection + lazy merging, exactly what driftsort does, which is the current implementation of slice::sort
.
For questions about methodology design choices and evaluation, please go and read the design documents.
Thanks for the info @Voultapher. I definitely respect that you and @orlp spent a ton of time on this update. It's unfortunate that it has edge-cases where performance regresses; I've been in your position where one is trying to make a cross-ecosystem update so I definitely understand the frustration of seeing these regressions.
Unfortunately I don't have any cycles to spend on this right now; it was an unexpected and unplanned performance regression (from my point of view); and I have no way of reproducing it (the above Wasm is a "schema" but not an actual file to test). @plusvic since you were the reporter, would you be willing to take your regressing use-cases, and perhaps try the fix suggested above (sort
instead of sort_stable
) with a local patch of regalloc2? I'd be very happy to review a PR that makes this update if you find that it helps, and then we can cut releases and propagate it through to Cranelift and Wasmtime.
Thanks!
I can confirm that by using sort_by_key
instead of sort_unstable_by_key
, performance improves a lot in rustc 1.81. However, rustc 1.80 keeps being faster. These are the results I got:
Rust 1.81 sort_unstable_by_key
[2024-12-02T08:46:47Z INFO yara_x::compiler] WASM module build time: 29.76301438s
Rust 1.81 sort_by_key
[2024-12-02T08:41:44Z INFO yara_x::compiler] WASM module build time: 6.737162719s
Rust 1.80 sort_unstable_by_key
[2024-12-02T08:51:17Z INFO yara_x::compiler] WASM module build time: 2.684071474s
Rust 1.80 sort_by_key
[2024-12-02T08:54:38Z INFO yara_x::compiler] WASM module build time: 4.614773787s
In resume, for this particular case sort_by_key
is faster than sort_unstable_by_key
in rustc 1.81 by a factor of 4-5x. However, in rustc 1.80 is the other way around, sort_by_key
is slower. rustc 1.80 still beats rust 1.81.
I also decided to give a try to a different solution, so I replaced this code:
self.bundles[to].ranges.extend_from_slice(&from_list[..]);
self.bundles[to]
.ranges
.sort_by_key(|entry| entry.range.from);
With this...
for entry in from_list.into_iter() {
let pos = self.bundles[to]
.ranges
.binary_search_by_key(&entry.range.from, |entry| entry.range.from)
.unwrap_or_else(|pos| pos);
self.bundles[to].ranges.insert(pos, entry)
}
This ended up being 2x faster that the existing implementation, and performance doesn't vary with the rustc version.
[2024-12-02T09:30:42Z INFO yara_x::compiler] WASM module build time: 1.088564278s
Of course this is a very specific case, in which we have a very long sorted slice, and we are adding only a handful of new items to it. I don't know if this is the most common case, probably not.
As my knowledge about the regalloc2
crate is close to zero, my opinion should be taken with a grain of salt, but I would say that the most robust solution here is implementing the mergesort-like merge that is mentioned in the comment. It may be slower than the current "append and re-sort" solution in particular cases, but in the general case it's guaranteed to depend linearly on the total number of items, something that current method can't guarantee.
If it turns out that in most cases the merge_bundles
function is trying to merge a relatively short slice into a larger one, then the "binary search and insert" method I proposed above is also a good alternative.
@plusvic Would it be possible to log and post the distribution of (left.len(), right.len())
?
I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:
1, 3, 5, 7 2, 4, 6, 8
That said, I do think if this is so performance critical a proper merge algorithm instead of concatenating + sorting is the play.
@plusvic I find it surprising that Rust 1.80 sort_by_key
is ~1.45x faster than the Rust 1.81 version of it. The relevant code for such a pattern stayed almost the same, run detection stayed very similar and merge only saw small changes, none of them should be able to account for such a big difference. Could you please share a repro repository that contains everything including documentation how to run it to observe the same changes you got, I'd like to dig into it.
In the meantime feel free to try out just using the merge function from the stdlib link.
This is what I get while logging the length of both slices. The first one is self.bundles[to].ranges
, the second one is from_list
.
1,1
1,1
1,1
1,1
1,2
1,1
1,2
1,5
1,1
2,1
3,6
3,1
4,1
5,1
1,1
2,1
6,3
1,1
2,1
3,1
4,1
5,1
... a lot of lines removed here, all of them were N, 1 where N goes from 6 to 49998.
49999,1
50000,1
9,1
9,1
10,1
11,1
12,1
13,1
14,1
While compiling less extreme code I got this:
1,1
1,1
1,2
1,1
1,1
1,1
1,2
1,1
1,2
1,1
1,2
1,1
1,2
2,2
1,4
1,5
1,1
1,1
1,1
1,6
2,2
1,4
1,5
1,6
7,2
1,9
1,1
7,10
1,17
1,1
1,18
1,19
5,2
1,7
1,8
1,9
10,2
1,12
1,1
1,13
1,1
2,1
3,1
20,14
4,1
5,1
6,1
7,1
2,1
3,1
4,1
5,1
6,1
7,1
1,1
2,1
3,1
34,1
35,1
36,1
8,1
37,1
38,1
39,1
8,1
2,1
3,1
4,1
5,1
6,1
7,1
2,1
3,1
4,1
5,1
6,1
7,1
1,1
2,1
3,1
4,1
40,1
41,1
42,1
8,1
43,1
44,1
45,1
8,1
2,1
1,1
2,1
3,1
4,1
3,1
1,1
2,1
3,1
4,1
4,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
9,7
2,1
3,1
1,1
2,1
3,1
4,1
1,1
2,1
3,1
5,1
1,1
2,1
3,1
4,1
5,1
2,1
3,1
1,1
2,1
3,1
4,1
5,1
4,1
1,1
2,1
3,1
5,1
1,1
2,1
3,1
46,1
47,1
48,1
I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:
1, 3, 5, 7 2, 4, 6, 8
My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item). The key of this implementation is that, when some item is not found in the first list, binary_search_by_key
returns an error that contains the position where that item would be located if it had existed. No matter if the item existed or not, you always get a position where you can insert the item and the list remains sorted. This is an example with the input you suggested:
@Voultapher, you can find the project at https://github.com/VirusTotal/yara-x
I recommend compiling it with the logging
feature: cargo build --release --features=logging
. You'll get a command-line tool at ./target/release/yr
. Run the tool as:
RUST_LOG=info ./target/release/yr scan big_rule.yar README.md
This runs the scan
command with two files, the second one (README.md) is not relevant in this case, you can use any file (but the tool expects some file, so I use my own README.md file). The big_rule.yar
file is the important one, it contains what you need to produce the WASM code that triggers this edge case. I'm attaching the big_rule.yar
file in a ZIP.
Let me know if you need anything else.
My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item).
@plusvic Yes, that's a really inefficient merge :) I missed that you iterated over each item.
rustc 1.81 completely replaced the implementation of slice::sort_unstable_by_key. These changes are mentioned in the release notes, and more information can be found in the corresponding PR: https://github.com/rust-lang/rust/pull/124032/.
The new sorting algorithm was supposed to be faster in the general case, but in some specific cases it looks like it is way slower, and this has impacted the
wasmtime
crate viaregalloc2
. I've experienced a 10x times slowdown when compiling some WASM modules usingwasmtime
with rustc 1.81 (more details can be found in: https://github.com/VirusTotal/yara-x/issues/209)I tracked down the issue to be caused by the merge_bundles function, which uses
sort_unstable_by_key
. According to my investigation,sort_unstable_by_key
is being called with slices that are mostly sorted. Apparently, two already sorted slices are being concatenated and then sorted again. This is confirmed by the following comment in the code ofmerge_bundles
:This pattern, which worked well with the previous implementation of
sort_unstable_by_key
, seems to be pathologically bad for the new one. The assumption that concatenating the two lists and sorting them again is faster than merging the two sorted lists likely no longer holds true.If you are curious about which WASM code is triggering this issue, it is an implementation of the "switch" statement in WASM using multiple nested blocks and a
br_table
instruction that jumps out the block that corresponds to the input value. The overall schema is shown below.