bytecodealliance / regalloc2

A new register allocator
Apache License 2.0
218 stars 39 forks source link

Performance regression with rustc 1.81 #203

Open plusvic opened 3 days ago

plusvic commented 3 days ago

rustc 1.81 completely replaced the implementation of slice::sort_unstable_by_key. These changes are mentioned in the release notes, and more information can be found in the corresponding PR: https://github.com/rust-lang/rust/pull/124032/.

The new sorting algorithm was supposed to be faster in the general case, but in some specific cases it looks like it is way slower, and this has impacted the wasmtime crate via regalloc2. I've experienced a 10x times slowdown when compiling some WASM modules using wasmtime with rustc 1.81 (more details can be found in: https://github.com/VirusTotal/yara-x/issues/209)

I tracked down the issue to be caused by the merge_bundles function, which uses sort_unstable_by_key. According to my investigation, sort_unstable_by_key is being called with slices that are mostly sorted. Apparently, two already sorted slices are being concatenated and then sorted again. This is confirmed by the following comment in the code of merge_bundles:

// Two non-empty lists of LiveRanges: concatenate and
// sort. This is faster than a mergesort-like merge into a new
// list, empirically.

This pattern, which worked well with the previous implementation of sort_unstable_by_key, seems to be pathologically bad for the new one. The assumption that concatenating the two lists and sorting them again is faster than merging the two sorted lists likely no longer holds true.

If you are curious about which WASM code is triggering this issue, it is an implementation of the "switch" statement in WASM using multiple nested blocks and a br_table instruction that jumps out the block that corresponds to the input value. The overall schema is shown below.

(func $switch_example (param $x i32)
  (block $default  ;; Default branch
    (block $case2   ;; Case 2
      (block $case1 ;; Case 1
        (block $case0 ;; Case 0
          ;; Use br_table to jump based on $x
          (local.get $x)    ;; Push x onto the stack
          (br_table $case0 $case1 $case2 $default) ;; Jump to the right case
        )
        ;; Code for Case 0
        ;; Add your case 0 logic here
        (br $default) ;; Continue to the end
      )
      ;; Code for Case 1
      ;; Add your case 1 logic here
      (br $default) ;; Continue to the end
    )
    ;; Code for Case 2
    ;; Add your case 2 logic here
    (br $default) ;; Continue to the end
  )
  ;; Default case code
)
Amanieu commented 1 day ago

cc @orlp @Voultapher (authors of the new sort implementation)

cfallin commented 1 day ago

Fascinating, thanks for tracking this down!

At least at a surface level, I'm not super-inclined to rewrite part of the register allocator based on a performance bug in std; that feels like something that should be fixed in std. Sorting of already almost-sorted data, or of two concatenated sorted subsequences, should be fast (and hopefully will return to being fast in the future, at which time we'd want to throw away whatever ad-hoc less-optimized custom workaround that we write). I'd encourage you to report the bug upstream though, if not already, and I'd be interested to hear what they think.

Voultapher commented 1 day ago

Hi, thanks @plusvic for raising the issue. Looking at the std library documentation:

It is typically faster than stable sorting, except in a few special cases, e.g., when the slice is partially sorted.

If your input is nearly sorted or all you want to do is merging, slice::sort is the much better fit as explained in the documentation, it's a hybrid merge and quicksort so it can efficiently handle nearly sorted sub-slices.

Voultapher commented 1 day ago

To expand a little more, @orlp and I are aware of the append + sort use case and did include it in our design and benchmarking, under the pattern name random_s95, 95% sorted + 5% unsorted unstable sort design document and stable sort design document.

The previous implementation of slice::sort_unstable did a partial insertion sort of up to 5 elements. So if the appended slice is at most 5 elements you get a pretty fast run-time. Something I suspect happened here. However this approach has a couple issues, it creates a rather arbitrary performance cliff, if you append 6 instead of 5 elements you suddenly get random pattern performance plus you wasted 5 near full scans of the input, as seen when looking at the number of comparisons performed:

image

The robust and generic fix is to run detection + lazy merging, exactly what driftsort does, which is the current implementation of slice::sort.


For questions about methodology design choices and evaluation, please go and read the design documents.

cfallin commented 16 hours ago

Thanks for the info @Voultapher. I definitely respect that you and @orlp spent a ton of time on this update. It's unfortunate that it has edge-cases where performance regresses; I've been in your position where one is trying to make a cross-ecosystem update so I definitely understand the frustration of seeing these regressions.

Unfortunately I don't have any cycles to spend on this right now; it was an unexpected and unplanned performance regression (from my point of view); and I have no way of reproducing it (the above Wasm is a "schema" but not an actual file to test). @plusvic since you were the reporter, would you be willing to take your regressing use-cases, and perhaps try the fix suggested above (sort instead of sort_stable) with a local patch of regalloc2? I'd be very happy to review a PR that makes this update if you find that it helps, and then we can cut releases and propagate it through to Cranelift and Wasmtime.

Thanks!

plusvic commented 5 hours ago

I can confirm that by using sort_by_key instead of sort_unstable_by_key, performance improves a lot in rustc 1.81. However, rustc 1.80 keeps being faster. These are the results I got:

Rust 1.81 sort_unstable_by_key
[2024-12-02T08:46:47Z INFO  yara_x::compiler] WASM module build time: 29.76301438s

Rust 1.81  sort_by_key
[2024-12-02T08:41:44Z INFO  yara_x::compiler] WASM module build time: 6.737162719s

Rust 1.80 sort_unstable_by_key
[2024-12-02T08:51:17Z INFO  yara_x::compiler] WASM module build time: 2.684071474s

Rust 1.80 sort_by_key
[2024-12-02T08:54:38Z INFO  yara_x::compiler] WASM module build time: 4.614773787s

In resume, for this particular case sort_by_key is faster than sort_unstable_by_key in rustc 1.81 by a factor of 4-5x. However, in rustc 1.80 is the other way around, sort_by_key is slower. rustc 1.80 still beats rust 1.81.

I also decided to give a try to a different solution, so I replaced this code:

self.bundles[to].ranges.extend_from_slice(&from_list[..]);
self.bundles[to]
   .ranges
   .sort_by_key(|entry| entry.range.from);

With this...

for entry in from_list.into_iter() {
    let pos = self.bundles[to]
        .ranges
        .binary_search_by_key(&entry.range.from, |entry| entry.range.from)
        .unwrap_or_else(|pos| pos);
    self.bundles[to].ranges.insert(pos, entry)
}

This ended up being 2x faster that the existing implementation, and performance doesn't vary with the rustc version.

[2024-12-02T09:30:42Z INFO  yara_x::compiler] WASM module build time: 1.088564278s

Of course this is a very specific case, in which we have a very long sorted slice, and we are adding only a handful of new items to it. I don't know if this is the most common case, probably not.

As my knowledge about the regalloc2 crate is close to zero, my opinion should be taken with a grain of salt, but I would say that the most robust solution here is implementing the mergesort-like merge that is mentioned in the comment. It may be slower than the current "append and re-sort" solution in particular cases, but in the general case it's guaranteed to depend linearly on the total number of items, something that current method can't guarantee.

If it turns out that in most cases the merge_bundles function is trying to merge a relatively short slice into a larger one, then the "binary search and insert" method I proposed above is also a good alternative.

orlp commented 4 hours ago

@plusvic Would it be possible to log and post the distribution of (left.len(), right.len())?

I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:

1, 3, 5, 7             2, 4, 6, 8

That said, I do think if this is so performance critical a proper merge algorithm instead of concatenating + sorting is the play.

Voultapher commented 3 hours ago

@plusvic I find it surprising that Rust 1.80 sort_by_key is ~1.45x faster than the Rust 1.81 version of it. The relevant code for such a pattern stayed almost the same, run detection stayed very similar and merge only saw small changes, none of them should be able to account for such a big difference. Could you please share a repro repository that contains everything including documentation how to run it to observe the same changes you got, I'd like to dig into it.

In the meantime feel free to try out just using the merge function from the stdlib link.

plusvic commented 3 hours ago

This is what I get while logging the length of both slices. The first one is self.bundles[to].ranges, the second one is from_list.

1,1
1,1
1,1
1,1
1,2
1,1
1,2
1,5
1,1
2,1
3,6
3,1
4,1
5,1
1,1
2,1
6,3
1,1
2,1
3,1
4,1
5,1

... a lot of lines removed here, all of them were N, 1 where N goes from 6 to 49998.

49999,1
50000,1
9,1
9,1
10,1
11,1
12,1
13,1
14,1

While compiling less extreme code I got this:

1,1
1,1
1,2
1,1
1,1
1,1
1,2
1,1
1,2
1,1
1,2
1,1
1,2
2,2
1,4
1,5
1,1
1,1
1,1
1,6
2,2
1,4
1,5
1,6
7,2
1,9
1,1
7,10
1,17
1,1
1,18
1,19
5,2
1,7
1,8
1,9
10,2
1,12
1,1
1,13
1,1
2,1
3,1
20,14
4,1
5,1
6,1
7,1
2,1
3,1
4,1
5,1
6,1
7,1
1,1
2,1
3,1
34,1
35,1
36,1
8,1
37,1
38,1
39,1
8,1
2,1
3,1
4,1
5,1
6,1
7,1
2,1
3,1
4,1
5,1
6,1
7,1
1,1
2,1
3,1
4,1
40,1
41,1
42,1
8,1
43,1
44,1
45,1
8,1
2,1
1,1
2,1
3,1
4,1
3,1
1,1
2,1
3,1
4,1
4,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
9,7
2,1
3,1
1,1
2,1
3,1
4,1
1,1
2,1
3,1
5,1
1,1
2,1
3,1
4,1
5,1
2,1
3,1
1,1
2,1
3,1
4,1
5,1
4,1
1,1
2,1
3,1
5,1
1,1
2,1
3,1
46,1
47,1
48,1

I'm afraid that your proposed alternative implementation is not a correct merge, unless there are other specific properties of the ranges I'm not aware of. Consider the following input:

1, 3, 5, 7 2, 4, 6, 8

My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item). The key of this implementation is that, when some item is not found in the first list, binary_search_by_key returns an error that contains the position where that item would be located if it had existed. No matter if the item existed or not, you always get a position where you can insert the item and the list remains sorted. This is an example with the input you suggested:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e3af50bcb2d8e8d1205d72e05ccbedd3

plusvic commented 2 hours ago

@Voultapher, you can find the project at https://github.com/VirusTotal/yara-x

I recommend compiling it with the logging feature: cargo build --release --features=logging. You'll get a command-line tool at ./target/release/yr. Run the tool as:

RUST_LOG=info ./target/release/yr scan big_rule.yar README.md

This runs the scan command with two files, the second one (README.md) is not relevant in this case, you can use any file (but the tool expects some file, so I use my own README.md file). The big_rule.yar file is the important one, it contains what you need to produce the WASM code that triggers this edge case. I'm attaching the big_rule.yar file in a ZIP.

big_rule.yar.zip

Let me know if you need anything else.

orlp commented 2 hours ago

My proposed alternative is not really a merge (at least not an efficient merge), but the result should be correct. It simply iterates the second vector looking for the index in the first vector where each item should be inserted, and does it (which implies displacing items in the first list to make room for the new item).

@plusvic Yes, that's a really inefficient merge :) I missed that you iterated over each item.