Closed turol closed 2 weeks ago
Triggers UBSAN failures. You should probably have that in your CI...
Yes, the GitHub CI does not run any tests. It is on my TODO list, but haven't got the time to add that yet. Is UBSAN/ASAN triggered only with these changes?
I experimented with this and I'm not sure I'm running it right, or that I'm interpreting the results the right way.
For example, comparing a disasmtool
compiled from 8b4cc482b96d9dbb539b5b69d35753bfbab77743 with the one from this PR, I get:
$ ./benchmark.sh ./old ./new in.bin 10
x before.txt
+ after.txt
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| x + + |
|x x x x + x x + x * + x + + + +|
| |________________________________M_____A________|_____________________________|_______________M__A________________________________________________| |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1.406 1.434 1.417 1.4186 0.010297788
+ 10 1.416 1.456 1.433 1.4339 0.012931014
Difference at 95.0% confidence
0.0153 +/- 0.0109827
1.07853% +/- 0.774195%
(Student's t, pooled s = 0.0116888)
Doesn't this mean that the new version is actually slower?
Compiled for release with gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, running on a Ubuntu 22.04.4 LTS (via WSL2 on Windows 11 23H2), with a 13th Gen Intel(R) Core(TM) i7-1370P CPU.
UBSAN was triggered with these changes, the previous code is clean. The branchless sign extension is not OK when the input is already 64-bit.
That output means that it is indeed slower. I was using GCC 13 on Debian testing running on real hardware. It could also still be a measurement artifact. I was using 20 samples.
What is the size of your in.bin
?
Do you get the expected null result if old
and new
are identical binaries?
On an AMD 5950x with Gentoo, GCC 13 and the same ~60 MB binary with 100 samples: Two identical copies of the old binary
N Min Max Median Avg Stddev
x 100 1.421 1.547 1.465 1.46747 0.020762464
+ 100 1.422 1.506 1.462 1.46093 0.020405117
Difference at 95.0% confidence
-0.00654 +/- 0.00570575
-0.445665% +/- 0.388815%
(Student's t, pooled s = 0.0205846)
Old vs new code
N Min Max Median Avg Stddev
x 100 1.427 1.574 1.493 1.48831 0.023063678
+ 100 1.386 1.524 1.424 1.42498 0.021953109
Difference at 95.0% confidence
-0.06333 +/- 0.00624091
-4.25516% +/- 0.419328%
(Student's t, pooled s = 0.0225152)
The performance characteristics of modern systems are just completely f***** up. Maybe I should configure it for 99% confidence and always do more samples but that takes forever to run.
I used a 58M file I obtained by concatenating a bunch of shellcodes together.
If I use the same binary I get timings that are much closer:
$ ./benchmark.sh ./old ./old in.bin 10
N Min Max Median Avg Stddev
x 10 1.456 1.518 1.502 1.4941 0.021855332
+ 10 1.452 1.527 1.502 1.4937 0.02150478
No difference proven at 95.0% confidence
But this is just a small sample. These kinds of changes are small enough that if we want to accurately measure them we probably need to build some micro benchmarks.
I managed to fix UBSAN. It turns out that sometimes ND_SIGN_EX
is called with sz == 0
...
What do I need to do to get this merged?
Hello,
We'll go through the PR soon, and we will let you know if changes are needed.
Thanks for the contribution!
First of all, Looking at your benchmark, I see that you used the nv
option to skip the textual disassembly, which is good. disasmtool
also includes an option to run an internal benchmark, which measures how long it takes to decode the instructions - check out the iv
option.
Secondly, the way you run the benchmark, the measurement will be skewed by a lot of different factors, such as file I/O, interrupts that might be generated, etc. - this is because you measure the runtime of the entire executable. It is much better to micro-benchmark the exact event you want to measure - in this case, only the instruction decoding step (this is what the iv
option does).
Finally, I measured the old code base (the current one) and the new code base (your modifications), and it shows that your changes make the decoding slightly slower (4.7% slower for regular code, 5.2% slower for random data). Test methodology involved decoding 2 files, each 1G in size: one containing regular x64 code, and one containing random data:
Regular code, 1G file:
Old disasm: Disassembled 304551208 instructions in 31194ms, 9763134.1925 instructions/second, 247.169054 clocks/instruction, average ilen 3.525704 bytes
New disasm: Disassembled 304551208 instructions in 32545ms, 9357849.3778 instructions/second, 258.988099 clocks/instruction, average ilen 3.525704 bytes
Random bytes, 1G file:
Old disasm: Disassembled 415555063 instructions in 45722ms, 9088733.2794 instructions/second, 267.193588 clocks/instruction, average ilen 2.822585 bytes
New disasm: Disassembled 415555063 instructions in 47967ms, 8663353.2012 instructions/second, 281.134721 clocks/instruction, average ilen 2.822585 bytes
Of course, being ~5% slower is not necessary bad by itself, but the point is that I fail to see any consistent decoding performance improvement with the current changes.
I/O effects should have been eliminated by the pre-runs before the actual benchmark loop. Interrupt effects can't be eliminated without radically changing the system configuration. If the OS wants to punt our program from the CPU to do something else it will. We can only minimize the effect by running more samples.
I replaced use of time
with instructions/second from -iv
output. Now higher value is better. However I still get the same result.
You appear to be using clock()
for measuring time without using CLOCKS_PER_SEC
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/clock?view=msvc-170
So some of my numbers are different from yours by a factor of 1000.
Numbers on my Intel i7-12700H laptop but I get broadly similar results from AMD 5950x.
For a real executable (.text section only, 32MB):
Disassembled 19791808 instructions in 1156758ms, 17109.7222 instructions/second, 141.391124 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1094370ms, 18085.1156 instructions/second, 132.392333 clocks/instruction, average ilen 3.393740 bytes
N Min Max Median Avg Stddev
x 30 16955.914 17147.389 17066.273 17059.275 50.281244
+ 30 17805.978 18094.127 18026.59 18007.061 63.651212
Difference at 95.0% confidence
947.786 +/- 29.6487
5.55584% +/- 0.173798%
(Student's t, pooled s = 57.3571)
And 32MB of randomness:
Disassembled 12987011 instructions in 856135ms, 15169.3495 instructions/second, 161.957771 clocks/instruction, average ilen 2.822512 bytes
Disassembled 12987011 instructions in 809213ms, 16048.9401 instructions/second, 152.054722 clocks/instruction, average ilen 2.822512 bytes
N Min Max Median Avg Stddev
x 30 14934.328 15169.35 15072.905 15062.49 58.431712
+ 30 15013.99 16048.94 15972.651 15932.42 181.52074
Difference at 95.0% confidence
869.93 +/- 69.7011
5.77547% +/- 0.462746%
(Student's t, pooled s = 134.841)
There is something very very strange going on in your setup. Are you also using WSL2? Can you try on real non-virtualized setup?
Are you only running each benchmark once? The reason I use a multiple runs is to
There is nothing strange about my setup - it's bare metal Windows. What is different though is that you're running the tests on Linux (WSL), while I'm running them on Windows. Ergo, it probably boils down to how the compiler decides to generate & arrange the code (GCC vs. MSVC), as you're reporting improvement, whereas I have a downgrade of performance. We may both be right... :) However, like I said in my previous comment, I currently fail to see a consistent performance improvement with these changes. I will investigate how it behaves in WSL as well, and I will go through the code changes, to see what should be kept, and what should be modified in order to get a consistent performance improvement. Thank you again for your contribution!
Updated. New numbers:
Disassembled 19791808 instructions in 1112486ms, 17790.6131 instructions/second, 135.348606 clocks/instruction, average ilen 3.393740 bytes
Disassembled 19791808 instructions in 1068668ms, 18520.0717 instructions/second, 128.996045 clocks/instruction, average ilen 3.393740 bytes
N Min Max Median Avg Stddev
x 30 17568.559 17824.131 17771.013 17764.312 49.819621
+ 30 18105.614 18600.397 18529.66 18490.219 103.5881
Difference at 95.0% confidence
725.907 +/- 42.0142
4.08632% +/- 0.236509%
(Student's t, pooled s = 81.2788)
And I'm running Linux on real hardware, no virtualization. @ianichitei was using WSL2 which I suspected might be screwing up the timings.
ND_MSB
needs a cast to ND_UINT64
or ubsan fails at
https://github.com/bitdefender/bddisasm/blob/8b4cc482b96d9dbb539b5b69d35753bfbab77743/bddisasm/bdx86_decoder.c#L1419
because a ND_UINT32
is shifted by 63 bits.
But with it there's a warning at https://github.com/bitdefender/bddisasm/blob/8b4cc482b96d9dbb539b5b69d35753bfbab77743/bdshemu/bdshemu_x86.c#L2860
What would be the best way to fix this?
ND_MSB
needs a cast toND_UINT64
or ubsan fails atbecause a
ND_UINT32
is shifted by 63 bits.But with it there's a warning at
What would be the best way to fix this?
Hello! The problem there is that the displacement sign extension takes place even if there is no displacement (the DispLength
is 0). This is the case here and here. With the new macro, you should make sure to skip sign-extending the displacement, if there is none, as it will result in an undefined behavior, due to the 0-length displacement: Operand->Info.Memory.Disp = Instrux->HasDisp ? ND_SIGN_EX(Instrux->DispLength, Instrux->Displacement) : 0;
.
The ND_MSB
can therefore remain simple: #define ND_MSB(sz, x) ((((x)) >> (((sz) * 8) - 1)) & 1)
, with no need for type-casts. This also ensures we won't break anyone's code.
Thanks, that did it.
A few cleanups, a benchmark script and some optimizations.
Disassembling a ~60MB binary on my Intel i7-12700H laptop: