bitdefender / bddisasm

bddisasm is a fast, lightweight, x86/x64 instruction decoder. The project also features a fast, basic, x86/x64 instruction emulator, designed specifically to detect shellcode-like behavior.
Apache License 2.0
875 stars 112 forks source link

Optimizations #96

Closed turol closed 2 weeks ago

turol commented 1 month ago

A few cleanups, a benchmark script and some optimizations.

Disassembling a ~60MB binary on my Intel i7-12700H laptop:

    N           Min           Max        Median           Avg        Stddev
x  20         1.111         1.136         1.118        1.1192  0.0059701009
+  20         1.093         1.115           1.1       1.10075  0.0058478516
Difference at 95.0% confidence
    -0.01845 +/- 0.00378221
    -1.6485% +/- 0.337939%
    (Student's t, pooled s = 0.00590929)
turol commented 1 month ago

Triggers UBSAN failures. You should probably have that in your CI...

ianichitei commented 1 month ago

Yes, the GitHub CI does not run any tests. It is on my TODO list, but haven't got the time to add that yet. Is UBSAN/ASAN triggered only with these changes?

I experimented with this and I'm not sure I'm running it right, or that I'm interpreting the results the right way.

For example, comparing a disasmtool compiled from 8b4cc482b96d9dbb539b5b69d35753bfbab77743 with the one from this PR, I get:

$ ./benchmark.sh ./old ./new in.bin 10
x before.txt
+ after.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                       x              +                                                            +                                                                                           |
|x       x              x      x       +   x          x                      +                  x   *   +  x       +                             +               +                             +|
|         |________________________________M_____A________|_____________________________|_______________M__A________________________________________________|                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         1.406         1.434         1.417        1.4186   0.010297788
+  10         1.416         1.456         1.433        1.4339   0.012931014
Difference at 95.0% confidence
        0.0153 +/- 0.0109827
        1.07853% +/- 0.774195%
        (Student's t, pooled s = 0.0116888)

Doesn't this mean that the new version is actually slower?

Compiled for release with gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, running on a Ubuntu 22.04.4 LTS (via WSL2 on Windows 11 23H2), with a 13th Gen Intel(R) Core(TM) i7-1370P CPU.

turol commented 1 month ago

UBSAN was triggered with these changes, the previous code is clean. The branchless sign extension is not OK when the input is already 64-bit.

That output means that it is indeed slower. I was using GCC 13 on Debian testing running on real hardware. It could also still be a measurement artifact. I was using 20 samples.

turol commented 1 month ago

What is the size of your in.bin?

Do you get the expected null result if old and new are identical binaries?

On an AMD 5950x with Gentoo, GCC 13 and the same ~60 MB binary with 100 samples: Two identical copies of the old binary

    N           Min           Max        Median           Avg        Stddev
x 100         1.421         1.547         1.465       1.46747   0.020762464
+ 100         1.422         1.506         1.462       1.46093   0.020405117
Difference at 95.0% confidence
        -0.00654 +/- 0.00570575
        -0.445665% +/- 0.388815%
        (Student's t, pooled s = 0.0205846)

Old vs new code

    N           Min           Max        Median           Avg        Stddev
x 100         1.427         1.574         1.493       1.48831   0.023063678
+ 100         1.386         1.524         1.424       1.42498   0.021953109
Difference at 95.0% confidence
        -0.06333 +/- 0.00624091
        -4.25516% +/- 0.419328%
        (Student's t, pooled s = 0.0225152)

The performance characteristics of modern systems are just completely f***** up. Maybe I should configure it for 99% confidence and always do more samples but that takes forever to run.

ianichitei commented 1 month ago

I used a 58M file I obtained by concatenating a bunch of shellcodes together.

If I use the same binary I get timings that are much closer:

$ ./benchmark.sh ./old ./old in.bin 10
    N           Min           Max        Median           Avg        Stddev
x  10         1.456         1.518         1.502        1.4941   0.021855332
+  10         1.452         1.527         1.502        1.4937    0.02150478
No difference proven at 95.0% confidence

But this is just a small sample. These kinds of changes are small enough that if we want to accurately measure them we probably need to build some micro benchmarks.

turol commented 1 month ago

I managed to fix UBSAN. It turns out that sometimes ND_SIGN_EX is called with sz == 0 ...

turol commented 4 weeks ago

What do I need to do to get this merged?

vlutas commented 3 weeks ago

Hello,

We'll go through the PR soon, and we will let you know if changes are needed.

Thanks for the contribution!

vlutas commented 3 weeks ago

First of all, Looking at your benchmark, I see that you used the nv option to skip the textual disassembly, which is good. disasmtool also includes an option to run an internal benchmark, which measures how long it takes to decode the instructions - check out the iv option. Secondly, the way you run the benchmark, the measurement will be skewed by a lot of different factors, such as file I/O, interrupts that might be generated, etc. - this is because you measure the runtime of the entire executable. It is much better to micro-benchmark the exact event you want to measure - in this case, only the instruction decoding step (this is what the iv option does). Finally, I measured the old code base (the current one) and the new code base (your modifications), and it shows that your changes make the decoding slightly slower (4.7% slower for regular code, 5.2% slower for random data). Test methodology involved decoding 2 files, each 1G in size: one containing regular x64 code, and one containing random data:

Regular code, 1G file:
    Old disasm: Disassembled 304551208 instructions in 31194ms, 9763134.1925 instructions/second, 247.169054 clocks/instruction, average ilen 3.525704 bytes
    New disasm: Disassembled 304551208 instructions in 32545ms, 9357849.3778 instructions/second, 258.988099 clocks/instruction, average ilen 3.525704 bytes
Random bytes, 1G file:
    Old disasm: Disassembled 415555063 instructions in 45722ms, 9088733.2794 instructions/second, 267.193588 clocks/instruction, average ilen 2.822585 bytes
    New disasm: Disassembled 415555063 instructions in 47967ms, 8663353.2012 instructions/second, 281.134721 clocks/instruction, average ilen 2.822585 bytes

Of course, being ~5% slower is not necessary bad by itself, but the point is that I fail to see any consistent decoding performance improvement with the current changes.

turol commented 3 weeks ago

I/O effects should have been eliminated by the pre-runs before the actual benchmark loop. Interrupt effects can't be eliminated without radically changing the system configuration. If the OS wants to punt our program from the CPU to do something else it will. We can only minimize the effect by running more samples.

I replaced use of time with instructions/second from -iv output. Now higher value is better. However I still get the same result.

You appear to be using clock() for measuring time without using CLOCKS_PER_SEC https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/clock?view=msvc-170 So some of my numbers are different from yours by a factor of 1000.

Numbers on my Intel i7-12700H laptop but I get broadly similar results from AMD 5950x.

For a real executable (.text section only, 32MB):

Disassembled 19791808 instructions in 1156758ms, 17109.7222 instructions/second, 141.391124 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1094370ms, 18085.1156 instructions/second, 132.392333 clocks/instruction, average ilen 3.393740 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     16955.914     17147.389     17066.273     17059.275     50.281244
+  30     17805.978     18094.127      18026.59     18007.061     63.651212
Difference at 95.0% confidence
    947.786 +/- 29.6487
    5.55584% +/- 0.173798%
    (Student's t, pooled s = 57.3571)

And 32MB of randomness:

Disassembled 12987011 instructions in 856135ms, 15169.3495 instructions/second, 161.957771 clocks/instruction, average ilen 2.822512 bytes
Disassembled 12987011 instructions in 809213ms, 16048.9401 instructions/second, 152.054722 clocks/instruction, average ilen 2.822512 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     14934.328      15169.35     15072.905      15062.49     58.431712
+  30      15013.99      16048.94     15972.651      15932.42     181.52074
Difference at 95.0% confidence
    869.93 +/- 69.7011
    5.77547% +/- 0.462746%
    (Student's t, pooled s = 134.841)

There is something very very strange going on in your setup. Are you also using WSL2? Can you try on real non-virtualized setup?

Are you only running each benchmark once? The reason I use a multiple runs is to

  1. Get some proper statistics for ministat
  2. Alternate between the two programs to reduce the effect of clock speed rampup or thermal throttling
  3. Reduce the effect of virtual memory layout
vlutas commented 3 weeks ago

There is nothing strange about my setup - it's bare metal Windows. What is different though is that you're running the tests on Linux (WSL), while I'm running them on Windows. Ergo, it probably boils down to how the compiler decides to generate & arrange the code (GCC vs. MSVC), as you're reporting improvement, whereas I have a downgrade of performance. We may both be right... :) However, like I said in my previous comment, I currently fail to see a consistent performance improvement with these changes. I will investigate how it behaves in WSL as well, and I will go through the code changes, to see what should be kept, and what should be modified in order to get a consistent performance improvement. Thank you again for your contribution!

turol commented 3 weeks ago

Updated. New numbers:

Disassembled 19791808 instructions in 1112486ms, 17790.6131 instructions/second, 135.348606 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1068668ms, 18520.0717 instructions/second, 128.996045 clocks/instruction, average ilen 3.393740 bytes

    N           Min           Max        Median           Avg        Stddev
x  30     17568.559     17824.131     17771.013     17764.312     49.819621
+  30     18105.614     18600.397      18529.66     18490.219      103.5881
Difference at 95.0% confidence
    725.907 +/- 42.0142
    4.08632% +/- 0.236509%
    (Student's t, pooled s = 81.2788)
turol commented 3 weeks ago

And I'm running Linux on real hardware, no virtualization. @ianichitei was using WSL2 which I suspected might be screwing up the timings.

turol commented 3 weeks ago

ND_MSB needs a cast to ND_UINT64 or ubsan fails at https://github.com/bitdefender/bddisasm/blob/8b4cc482b96d9dbb539b5b69d35753bfbab77743/bddisasm/bdx86_decoder.c#L1419

because a ND_UINT32 is shifted by 63 bits.

But with it there's a warning at https://github.com/bitdefender/bddisasm/blob/8b4cc482b96d9dbb539b5b69d35753bfbab77743/bdshemu/bdshemu_x86.c#L2860

What would be the best way to fix this?

vlutas commented 3 weeks ago

ND_MSB needs a cast to ND_UINT64 or ubsan fails at

https://github.com/bitdefender/bddisasm/blob/8b4cc482b96d9dbb539b5b69d35753bfbab77743/bddisasm/bdx86_decoder.c#L1419

because a ND_UINT32 is shifted by 63 bits.

But with it there's a warning at

https://github.com/bitdefender/bddisasm/blob/8b4cc482b96d9dbb539b5b69d35753bfbab77743/bdshemu/bdshemu_x86.c#L2860

What would be the best way to fix this?

Hello! The problem there is that the displacement sign extension takes place even if there is no displacement (the DispLength is 0). This is the case here and here. With the new macro, you should make sure to skip sign-extending the displacement, if there is none, as it will result in an undefined behavior, due to the 0-length displacement: Operand->Info.Memory.Disp = Instrux->HasDisp ? ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement) : 0;.

The ND_MSB can therefore remain simple: #define ND_MSB(sz, x) ((((x)) >> (((sz) * 8) - 1)) & 1), with no need for type-casts. This also ensures we won't break anyone's code.

turol commented 3 weeks ago

Thanks, that did it.