Changing an immediate operand from 0 to 1 makes suspicion performance decreasing

hungryzzz commented 1 month ago

Test Cases

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

Wasmtime: 1.42s
WasmEdge: 1.05s

For bad.wasm, the execution time in different runtimes are as follows:

Wasmtime: 3.29s
WasmEdge: 0.91s

The difference between the attached two cases is as follow, i.e., changing one of the operand of i32.and from 0 to 1. The difference and bring 1.8s performance decreasing on Wasmtime but has no negative effect on WasmEdge.

➜  cases diff good.wat bad.wat
35c35
<             i32.const 0
---
>             i32.const 1

;; part of good.wat
  if
    i32.const 1
    local.set 2
    i32.const 1
    local.get 0
    i32.const 0 ;; here
    i32.and
    i32.add
    local.set 6
    br 1

;; part of bad.wat
if 
  i32.const 1
  local.set 2
  i32.const 1
  local.get 0
  i32.const 1 ;; here
  i32.and
  i32.add
  local.set 6
  br 1

I check the machine code generated by Wasmtime and WasmEdge respectively, the instruction numbers of func2(the difference is in it) are as follows:

WasmEdge
- good.wasm: 89
- bad.wasm: 92
- diff: 3
Wasmtime
- good.wasm: 132
- bad.wasm: 157
- diff: 25

I think maybe the difference affect the data flow analysis during optimization so maybe some optimization decisions are different , but I really confuse why the instructions number of generated machine code could change so much, and I think maybe the extra instructions in bad.wasm make it slower than the good one. I also use Perf tool to profile the hotspots in Wasmtime but I cannot find something useful.

Versions and Environment

Wasmtime version or commit: 7f7064c74
Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
Architecture: Intel(R) Core(TM) i5-13500

cfallin commented 1 month ago

Hi @hungryzzz -- while we appreciate these reports, they seem to be generated by some infrastructure and "mass-produced", so to speak, so to help us out, it may be good to do a little bit more work. In particular: for each performance-delta report (including this one), would you be able to reduce the report to assembly, show us the difference in generated assembly, and write a microbenchmark in assembly that isolates the performance effect and shows it to be valid (as I mentioned in a previous comment on a very similar issue)?

hungryzzz commented 1 month ago

Hi @cfallin, I write a microbenchmark in assembly based on the generated machine code of good.wasm and bad.wasm. And I also provide a comparison report show the differences of the machine code. The microbenchmark and report are attached in reports.zip. Thank you!

# The execution times of `asm_good` and `asm_bad` are as follows:
➜  binary-asm time ./asm_good
./asm_good  1.52s user 0.00s system 99% cpu 1.524 total
➜  binary-asm time ./asm_bad
./asm_bad  3.00s user 0.00s system 99% cpu 3.004 total

cfallin commented 1 month ago

Thanks @hungryzzz. Unfortunately there's still a nontrivial diff, I guess I had hoped for a little more investigation to point to exactly what the microarchitectural difference was; take the two inner loops, transform one into the other one step at a time, see where the cliff happens, that sort of thing. To be more direct: most folks here (at least those working on Cranelift full-time) don't have time to dig into a performance anomaly of several hundred lines of assembly with unclear prioritization (i.e., found via fuzzing rather than via real-use-case report) so for this effort to be successful you'll have to do a little more of the investigation yourself and point to what we can change. Sorry and thanks!

iximeow commented 1 month ago

@hungryzzz another point to keep in mind, the diff represented by report.html understates the true difference between asm_good and asm_bad: in addition to the additional instructions in asm_bad, many of the instructions that are mov in both are actually mov %reg, offset(%rsp) in asm_bad, where they were register-register movs on the right. i skimmed asm_good.c and asm_bad.c and the additional saves and loads from the stack seem to be the vast majority of the difference between these two.

incidentally, these run in about the same time on a Ryzen 9 3950X and i'm downright shocked at the IPC it managed. here's asm_bad.c with almost 5.5 IPC (!):

[11:44:28] # iximeow:~/reports> sudo perf stat -e cycles -e instructions -e L1-dcache-loads \
  -e L1-dcache-misses -e stalled-cycles-frontend -e stalled-cycles-backend \
  ./bad

 Performance counter stats for './bad':

    17,220,141,631      cycles                                                        (83.31%)
    94,551,730,488      instructions              #    5.49  insn per cycle         
                                                  #    0.05  stalled cycles per insn  (83.35%)
    12,890,324,509      L1-dcache-loads                                               (83.36%)
           270,501      L1-dcache-misses          #    0.00% of all L1-dcache hits    (83.36%)
           756,707      stalled-cycles-frontend   #    0.00% frontend cycles idle     (83.35%)
     4,303,679,448      stalled-cycles-backend    #   24.99% backend cycles idle      (83.27%)

       3.869444889 seconds time elapsed

       3.867886000 seconds user
       0.000000000 seconds sys

vs asm_good.c which runs half as many instructions in almost the same total time:

[11:45:00] # iximeow:~/reports> sudo perf stat -e cycles -e instructions -e L1-dcache-loads -e L1-dcache-misses \
  -e stalled-cycles-frontend -e stalled-cycles-backend \
  ./good

 Performance counter stats for './good':

    17,212,385,161      cycles                                                        (83.29%)
    47,256,102,329      instructions              #    2.75  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (83.29%)
         3,791,613      L1-dcache-loads                                               (83.30%)
           134,600      L1-dcache-misses          #    3.55% of all L1-dcache hits    (83.40%)
           645,951      stalled-cycles-frontend   #    0.00% frontend cycles idle     (83.39%)
         8,859,964      stalled-cycles-backend    #    0.05% backend cycles idle      (83.32%)

       3.975132228 seconds time elapsed

       3.969715000 seconds user
       0.003997000 seconds sys

i suspect an iterative transformation from one loop to the other probably won't be super illuminating: something like "movs to and from the stack are slow on some machines, and as you add more it gets more slow". but @hungryzzz another approach that would certainly help here is if you can reduce the input wasm program down - if the program were 30 instructions and switching the constant from 0 to 1 still produces significantly different codegen, that's much easier to investigate.

hungryzzz commented 1 month ago

Hi, when I try to delete part of the code in the above bad case, I find many small deletion cannot replay this bug, so I think I cannot reduce the bad case to a shorter one.

But I find another interesting thing during the reduction. I comment out the following part of the code as this branch must be triggered. However, execution times are quite different after this deletion.

bad.wasm: 3.33s
after deletion (called good.wasm in the attached file): 1.80s

And still I find the generated machine codes are quite difference which are attached. So I guess maybe some analysis during compilation are wrong, resulting the suboptimal code generation. But actually I don't know how to inspect further for this. If you need more investigation, I am willing to do that under your suggestions.

block
    ;; i32.const 1
    ;; if 
      local.get 6
      local.set 5
      ;; br 1
    ;; end
end

new.zip

cfallin commented 1 month ago

Ah, if making that assignment from local 6 to local 5 unconditional causes the performance delta to disappear, likely what is happening here is that you're seeing the lack of constant-branch folding in Cranelift. That's not a "wrong analysis", that's just an optimization we haven't implemented yet (and it's somewhat nontrivial to do in our egraph mid-end, though we have ideas). It's entirely possible that that could cascade into other optimizations (GVN, ...?) applying or not applying, causing more register pressure and the spills you're seeing.

hungryzzz commented 1 month ago

So the only way to validate the guessing is to implement the constant-branch folding in Cranelift, right? But how to explain the performance difference shown at first?

cfallin commented 1 month ago

The direct cause of the perf delta seems to be the additional spills, as @iximeow pointed out.

hungryzzz commented 1 month ago

I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the bad.wasm and I get another strange case.

This time I change a value which will be set to the global variable [1] from -65537 to 65536, and their executiom times are also quite different.

bad.wasm: 3.26s
after change (called good.wasm in the attached file): 1.80s

➜  cases diff bad.wat good.wat
124c124
<             i32.const -65537
---
>             i32.const 65536

;; part of bad.wat
i32.const -65537
global.set 1

;; part of good.wat
i32.const 65536
global.set 1

I have some confusions and findings about these cases:

Originally, I thought maybe the changed control flows bring additional spills in bad.wasm. However, the global variable[1] will not be used in program (I don't find global.get 1), so I think they should have nearly the same generated code.
I try to change -65537 to another negtive numbers, such as -1, -2, and the execution times are the same as the bad one. However, once I change it to some positive numbers, such as 65536, 100, and the executiom times are good. So I also try to change another number in the program to some negtive number, and the execution time is bad.
I run these two cases in wasmer which also uses Cranelift as their compiler (they use old version one, 0.91.1), the executime times of wasmer before and after the changes are the same.
- bad.wasm(wasmer): 2.70s
- after change(wasmer): 2.69s

Therefore, I still think there is something wrong during the compilation of bad.wasm. But still, I don't know how to do further investigation. The attached files are the new good.wasm and the report.html.

new.zip

hungryzzz commented 1 month ago

@cfallin Hi, I go though the generated machine codes and I think I know something about the performance difference.

I find that if the value which will be set to global variable[1] is a negtive number, then the compiler will use two instructions to represent the global.set, i.e., first moving $0xfffeffff to a register %r10d then moving the register to the specified memory address. And if the value is a positive number, then the compiler will use only one instruction to do it, i.e., directly moving the $0x10000 to the specified address. Because the first one needs one more register, additional spills are needed. By the way, I check the machine code generated by WasmEdge, it use like movl $0xfffeffff,0xa0(%rdi) to execute the bad.wasm.

# part of the machine code of bad.wasm
104 |  mov $0xfffeffff,%r10d
105 |  mov %r10d,0xa0(%rdi)

# part of the machine code of good.wasm
95 |  movl $0x10000,0xa0(%rdi)

After the above all investigation, I guess the root cause of all the performance differences is the different register allocation algorithm used by Wasmtime and others. I think maybe slightly changes on the bad.wasm will cause more requirements for register, then resutling in the additional spills to the stack.

cfallin commented 1 month ago

Ah, interesting, there may be something odd about the way we lower stores-of-immediates in the x64 backend that precludes this case -- @abrown do you think you'd be willing to take a look at this? (If true it should be reproducible with a small CLIF test that stores a constant 0x10000 or 0xfffeffff to some given address)

abrown commented 1 month ago

Sure, I can put this on the TODO list but no promises about when I get to it!

hungryzzz commented 4 weeks ago

@alexcrichton Hi, I pulled your PR and found it can fix the above case, thank you! But could I ask why positive and negative numbers are stored in different ways in memory before this PR? And how do you fix it? (I tried to understand these when I went through the PR but failed...) Thanks again!

cfallin commented 4 weeks ago

@hungryzzz the issue was a missed case in the instruction lowering. x86-64 has a store instruction (a form of mov) that can store arbitrary 32-bit values to the destination; however, we weren't matching signed-negative 32-bit values in our existing lowering rules, so we fell back to the slower path of "put the value in a register then store it". Alex's PR fixed the rule match (left-hand side) to catch that case as well.

hungryzzz commented 4 weeks ago

Got it! Thanks a lot!

hungryzzz commented 4 weeks ago

Hi, I am also wondering why only the one register requirement leads to so many additional spills? If requiring more registers, would it get worse? Thank you very much! (I really hope that my questions would not spend too much of your time...)

cfallin commented 4 weeks ago

@hungryzzz we're happy to help with questions like these, but the Bytecode Alliance Zulip is probably easier for such discussions (that are really about how compilers work rather than resolving this particular issue).

With Alex's fix in #8842 are we ok to close this issue or do you see any more performance deltas?

hungryzzz commented 3 weeks ago

The performance anomaly has been fixed, thanks a lot!

bytecodealliance / wasmtime