Closed e13fter closed 2 years ago
The code looks a bit strange even in objdump, why are there two CALL $+5
instructions? These should not usually be present on x86_64. My only guess is that those instructions are modified by relocations and objdump shows you unmodified version which is meaningless, you can check this by passing -r
additionally to objdump (if I understood objdump's man page right).
I don't know why there are these instructions, looks strange. When disassembling with objdump -r
, they are still there, but seems like objdump
hints that the real entry point is the instruction next to this call $+5
:
0000000000000000 <write_info>:
0: e8 00 00 00 00 call 5 <write_info+0x5>
1: R_X86_64_PC32 __fentry__-0x4
5: 55 push rbp
6: 48 89 e5 mov rbp,rsp
9: 41 55 push r13
b: 41 54 push r12
d: 53 push rbx
e: 49 89 f4 mov r12,rsi
11: 48 89 d3 mov rbx,rdx
14: be c0 00 40 02 mov esi,0x24000c0
19: ba 34 05 00 00 mov edx,0x534
1e: 48 81 ec 08 01 00 00 sub rsp,0x108
25: 48 8b 3d 00 00 00 00 mov rdi,QWORD PTR [rip+0x0] # 2c <write_info+0x2c>
28: R_X86_64_PC32 kmalloc_caches+0x54
2c: 65 48 8b 04 25 28 00 mov rax,QWORD PTR gs:0x28
33: 00 00
35: 48 89 45 e0 mov QWORD PTR [rbp-0x20],rax
39: 31 c0 xor eax,eax
3b: e8 00 00 00 00 call 40 <write_info+0x40>
3c: R_X86_64_PC32 kmem_cache_alloc_trace-0x4
40: 48 81 fb ff 00 00 00 cmp rbx,0xff
Actually, the old RUB SysSec's kAFL works on my hardware and I've retrieved decoder logs from working RUB SysSec's kAFL and from not working IntelLabs' kAFL.
IntelLabs' one (kafl_vuln_test.ko
is located at 0xffffffffc02b2000-0xffffffffc02b6000):
PSB
MODE
PIP 3896b000
VMCS
PSBEND
MODE
PGE ffffffffc02b2000
PGD ffffffffc02b2000
PSB
MODE
MODE
PIP 3896b000
VMCS
PSBEND
PGE ffffffffc02b2000
PGD ffffffffc02b2000
disasm(ffffffffc02b2000,ffffffffc02b2000) TNT: 0
PGE ffffffffc02b2040
PGD ffffffffc02b2040
disasm(ffffffffc02b2040,ffffffffc02b2040) TNT: 1
PGE ffffffffc02b2061
PGD ffffffffc02b2061
disasm(ffffffffc02b2061,ffffffffc02b2061) TNT: 9
PGE ffffffffc02b2081
PGD ffffffff8128d195
disasm(ffffffffc02b2081,ffffffff8128d195) TNT: 10
(0) ffffffffc02b2091 (Not Taken)
(3) ffffffffc02b20a4
PSB
MODE
PIP 3896b000
VMCS
PSBEND
PSB
MODE
PIP 3896b000
VMCS
PSBEND
TNT 9 (PGE -2128031339)
RUB SysSec's one (kafl_vuln_test.ko
at 0xffffffffc02cb000-0xffffffffc02cf000):
MODE
PIP 166a3000
VMCS
MODE
TIP.PGE ffffffffc02cb000
TIP.PGD 0 (0)
TIP.PGE ffffffffc02cb040
(0) ffffffffc02cb047 (Not Taken)
(1) ffffffffc02cb05c
TIP.PGD 0 (0)
TIP.PGE ffffffffc02cb061
(0) ffffffffc02cb064 (Not Taken)
(0) ffffffffc02cb073 (Taken)
(0) ffffffffc02cb0e7 (Taken)
(0) ffffffffc02cb12c (Not Taken)
(0) ffffffffc02cb139 (Not Taken)
(0) ffffffffc02cb146 (Not Taken)
(0) ffffffffc02cb153 (Not Taken)
(0) ffffffffc02cb160 (Taken)
TIP.PGD 0 (0)
TIP.PGE ffffffffc02cb081
(0) ffffffffc02cb091 (Not Taken)
(3) ffffffffc02cb0a4
TIP.PGD ffffffff8128d195 (0)
The suspictious thing here is that in IntelLabs' log the PT tracing is enabled (TIP.PGE) and at the same moment disabled (TIP.PGD) at 0xffffffffc02cb061 while in RUB SysSec's one it is enabled at 0xffffffffc02cb061 and disabled later and there are branch instruction log between these events.
Does it indicate that no branch instructions written by PT and the bitmap is empty because no branch transitions written?
I retrieved traces with this command for the IntelLabs' kAFL:
kAFL-Fuzzer/kafl_debug.py -vm_dir snapshot -agent targets/linux_x86_64/bin/fuzzer/kafl_vuln_test -mem 1024 -input in/1 -work_dir fuzz -ip0 0xffffffffc02b2000-0xffffffffc02b6000 -v --debug -action trace -trace
While for the RUB SysSec's kAFL they were dumped on the storage when fuzzing:
python2 kafl_fuzz.py ../snapshot/ram.qcow2 ../snapshot agents/linux_x86_64/fuzzer/kafl_vuln_test 512 ../in ../fuzz -ip0 0xffffffffc02cb000-0xffffffffc02cf000 -v --Purge
I attach the assembler code of the kafl_vuln_test.ko
module retrieved by objdump.
kafl_vuln_test.txt
but seems like objdump hints that the real entry point is the instruction next to this
call $+5
1: R_X86_64_PC32 __fentry__-0x4
relocation info means that linker will take __fentry__
function address relatively to address 1
, subtract 4 and place this 32-bit value at address 1, so the first instruction after relocations is actually call __fentry__
. Similarly, second call $+5
becomes call kmem_cache_alloc_trace
.
in IntelLabs' log the PT tracing is enabled (TIP.PGE) and at the same moment disabled (TIP.PGD) at 0xffffffffc02cb061
It really looks strange.
As a matter of fact, I planned for some time to take a look at kAFL again, try to fix bugs. Do you mind sharing your setup so I can too test it? Compiled driver, modifications that you made (agent, ..?), etc.
Remembered again about __fentry__
missing in Capstone output: it is actually correct, __fentry__
calls are patched out on module loading (more info here). Which gives that Capstone output is overall correct.
@vient, sorry for being stupid about the relocations. Didn't dig in them. Thank you for the __fentry__
remark.
If you need my agent hexdump modification, let me know, I don't have it right now. For debugging purposes I only uncommented some defines, I attach appropriate patches here.
Also, don't forget to create the /tmp/traces
folder, it isn't created automatically. It will contained raw traces and decoded ones.
Commenting out a macro at qemu-5.0.0/pt/disassembler.c:25 like this seems to help a lot:
#define limit_check(prev, next, limit) true //(!((limit >= prev) & (limit <= next)))
It uncovers bugs with crash detection (panics are caught but discarded as previously seen) but at least full regular coverage is obtained.
It seems that empty PGD packets are pretty common (as can be seen here), and mentioned check makes kAFL to stop branch processing when next branch is outside of [prev, limit) range. When PGD is empty, limit becomes equal to prev so this check can not be passed at all.
On the other hand, I can't understand from documentation how empty PGD packets are possible in this case (jumping out of IP range). Quote from docs:
... [PGD] will include the IP at which the tracing ends, unless ContextEn= 0 or TraceEn=0 at the conclusion of the instruction or event that cleared PacketEn.
ContextEn depends on CPL and CR3 value, which do not change in our case. TraceEn is controlled by us and is surely enabled.
Not sure how to resolve this situation properly, @il-steffen what do you think?
Lurking a bit more, there is an errata 014 with a promising title "Intel PT TIP.PGD May Not Have Target IP Payload", which spans 6-9th generations at least.
Besides, even if we had an IP value in PGD, this value points to COFI target, not source. What's the point in interpreting it as a range's end? It seems that we can drop this limit_check entirely. My line of reasoning goes like this: we went in range on PGE, all TNTs came in while we were in range, then we went out of range with PGD, which gives that all TNTs should be processable without range checks (if our disassembler is good enough).
Edit: logic looks unchanged in libxdc, except they have an additional check for this exact case, limit == start
. Usage of the whole limit_check is a bit confusing to me though so I can't derive its usefulness.
panics are caught but discarded as previously seen
It seems to be caused by race between kAFL and QEMU. QEMU immediately informs kAFL about panics, after that kAFL immediately proceeds to creation of a local bitmap copy without waiting for QEMU to fill shared memory one first. As a result, QEMU produces correct bitmap in shared memory but it is too late, kAFL have already copied empty bitmap which is then used to deduce that such crash has already occured.
As a temporary hack time.sleep(1)
may be added after line https://github.com/IntelLabs/kAFL/blob/a7a8ae3e5ccd044ac185c94e11bc152f8c0a231c/kAFL-Fuzzer/common/qemu.py#L637. As a more robust solution I propose an addition to QEMU protocol, an opcode BITMAP_FILLED
. Maybe there already is something like this in the protocol? I still have some troubles understanding it, to be honest.
On the race condition, does it help to add synchronization_disable_pt() in qemu/pt/hypercall.c before hypercall_snd_char(KAFL_PROTO_CRASH) ?
Regarding the disassembly/decode, there have been only two significant changes since the original kAFL. One is from kAFL to Redqueen (which this IntelLabs/kAFL is based on), and the other is from this patch: https://github.com/IntelLabs/kAFL/blob/master/patches/qemu/v5.0.0/0012-read-guest-memory-on-PT-decode-when-more-likely-paged.patch
I suggest to try without the patch 0012. I'm not sure if it actually improved anything, but I rarely use the Linux VM target.
Edit: logic looks unchanged in libxdc, except they have an additional check for this exact case,
limit == start
. Usage of the whole limit_check is a bit confusing to me though so I can't derive its usefulness.
Nice catch! Merging the function from libxdc seems to fix coverage feedback for the linux target. Can you confirm?
Also, it seems that adding synchronization_disable_pt() to the crash/kasan hypercall handlers improves the ability of the fuzzer to detect those. Can someone confirm? There is already a corresponding command KAFL_PROTO_FINALIZE in pt/interface.c which was apparently intended to do the same thing on timeout events.
PS: I also had to remove the very first call to limit_check(), it seems to be missing in libxdc as well.
Merging the function from libxdc seems to fix coverage feedback for the linux target. Can you confirm?
How exactly did you merge this function?
adding synchronization_disable_pt()
I'll try it with crash handler, don't have kASAN kernel at the moment to test with.
On the race condition, does it help to add synchronization_disable_pt() in qemu/pt/hypercall.c before hypercall_snd_char(KAFL_PROTO_CRASH) ?
Yes, looks like it fixed the problem with crashes.
Yes, this patch seems to work. I still wonder what range_check
is exactly doing. Is it some kind of sanity check? I've turned on DEBUG_TRACE_RETURN
and check_return
is never called so range_check
is always passed.
On another note, I've hit yet another bug in disassembler.c, this time a lookup_area
buffer overflow. Note that it's size is max_addr - min_addr
and in map_put/map_get we use self->max_addr - addr
as index. What happens when addr is equal to self->min_addr (apparently can occur on Linux) is an access to element past array end, which, coincidentally, is not zero and we get SIGSEGV on line https://github.com/IntelLabs/kAFL/blob/82f93f5acac7d1dfa0510a4b649abd321bd0270f/patches/qemu/v5.0.0/0012-read-guest-memory-on-PT-decode-when-more-likely-paged.patch#L441
My fix was to use addr - self->min_addr
index instead.
The range of addr
seems to be limited by out_of_bounds()
and could also be as large as max_addr
- would this not again cause an OOB with your fix? How about incrementing the allocated area instead?
You'll have to ask @schumilo about the deeper purpose of limit_check().
could also be as large as max_addr
It's not clear from the docs if the IP written to IA32_RTIT_ADDRn_B is included in range or not. In any case, adding one more element won't hurt. I assumed that this address is not included.
Uploaded a testing
branch with these fixes, plus a patch for hang on snapshot creation and hypercalls getting lost on snapshot load. Let me know how it works.
Thanks, @vient! I confirmed the code with your changes solved the problem. Thanks, @il-steffen, too! Your fixes work too.
I have another problem now. When providing only one initial testcase, it doesn't pass quick validation and the fuzzer hangs waiting for other seeds. When there are second seed, it always passes the validation and the fuzzing starts. It doesn't depend on the testcase itself, the behaviour is always the same. Providing the -funky option let the fuzzing begin.
When debugging with the -funky option, I noticed that only the very first out of 8 validation tries fails the validation. I see the FUP packet between the PGE and PGD packets, that causes PGD handler to execute the "unlikely" block and call the disasm
function twice.
https://github.com/IntelLabs/kAFL/blob/a7a8ae3e5ccd044ac185c94e11bc152f8c0a231c/patches/qemu/v5.0.0/0001-Reworked-patch-from-https-github.com-RUB-SysSec-redq.patch#L2101
It results in filling the bitmap twice also. @il-steffen, is it right behaviour?
As a temporal fix, I removed the disasm
call from the "unlikely" block and it allowed the validation pass.
I've just checked this on the newest KVM-PT with the very first qemu-3.1.0 version of QEMU-PT. And it turned out there is no such problem there. One initial seed is validated correctly and the fuzzing starts. Also there is no FUP packet in traces.
Hi @e13fter. Using latest KVM 5.11.4 and Qemu 5.0 patches I can't conclusively tell if this improves the situation. I do see less 'funky' events in the debug.log but it also seems that we don't manage to explore the full kafl_vuln_test.ko sample anymore.
If that 3.1 version works, can you try to narrow down the place where the bug was introduced? kafl_debug.py may also be useful to capture a few consecutive traces.
Hmm, reverting this does not seem to help here.
Can you check with kafl_debug.py -action noise
if this really fixes the initial trace for you? The tool allows you to run an individual input multiple times and compare the traces. You should also see some general 0.1% noise over time depending on the target.
My targets tend to be non-deterministic so the -funky
option was a better workaround for me.
All of this is also fixed in Nyx but I cannot say when RUB will publish it..
Sorry, I was wrong, that is why I deleted the comment.
It seems that the fix with limit_check
causes the validation failure. Without the fix it doesn't gather coverage, with it - the problem with validation appears.
I'm trying to work on it now.
And by the way, commit https://github.com/IntelLabs/kAFL/commit/fff597770f7286c521a7aac6744a21841f425572 seems to introduce the initial problems with coverage. Removing it wipes all problems even without our further patches.
I confirm that without patch 0014 check_return
is not triggered in disassembler.c.
What's more strange is that I made my own version of this 0014 patch by following docs and it does not trigger check_return
too. Maybe 0014 is itself buggy?
diff --git a/pt/decoder.c b/pt/decoder.c
index 03d3b687..2773fede 100644
--- a/pt/decoder.c
+++ b/pt/decoder.c
@@ -300,36 +300,41 @@ static inline void decoder_handle_fup(decoder_state_machine_t *self, uint64_t ad
}
}
-static inline uint64_t get_ip_val(uint8_t **pp, uint8_t *end, uint8_t len, uint64_t *last_ip){
+static inline uint64_t get_ip_val(uint8_t **pp, uint8_t *end, uint8_t type, uint64_t *last_ip){
uint8_t *p = *pp;
uint64_t v = *last_ip;
- uint8_t i;
- uint8_t shift = 0;
- switch(len){
+ switch(type){
case 0:
- v = 0;
- break;
+ return 0;
case 1:
+ v &= ~0xFFFFull;
+ v |= *(uint16_t*)p;
+ *pp += 2;
+ break;
case 2:
+ v &= ~0xFFFFFFFFull;
+ v |= *(uint32_t*)p;
+ *pp += 4;
+ break;
case 3:
- if (unlikely(!LEFT(len))) {
- *last_ip = 0;
- v = 0;
- break;
- }
- for (i = 0; i < len; i++, shift += 16, p += 2) {
- uint64_t b = *(uint16_t *)p;
- v = (v & ~(0xffffULL << shift)) | (b << shift);
- }
- v = ((int64_t)(v << (64 - 48))) >> (64 - 48); /* sign extension */
- *pp = p;
- *last_ip = v;
+ v = *(uint64_t*)p & 0xFFFFFFFFFFFFull;
+ *pp += 6;
+ v = (int64_t)(v << 16) >> 16; // sign extension
break;
- default:
- v = 0;
+ case 4:
+ v &= ~0xFFFFFFFFFFFFull;
+ v |= *(uint64_t*)p & 0xFFFFFFFFFFFFull;
+ *pp += 6;
break;
+ case 6:
+ v = *(uint64_t*)p;
+ *pp += 8;
+ break;
+ default:
+ assert(false);
}
+ *last_ip = v;
return v;
}
@@ -551,6 +556,7 @@ static inline void pip_handler(decoder_t* self, uint8_t** p){
#endif
break;
case PT_PKT_PSB_BYTE1:
+ self->last_tip_tmp = 0;
p += PT_PKT_PSB_LEN;
WRITE_SAMPLE_DECODED_DETAILED("PSB\n");
#ifdef DECODER_LOG
It seems that the fix with
limit_check
causes the validation failure. Without the fix it doesn't gather coverage, with it - the problem with validation appears.
These are probably independent issues. The problem with stability/noise of the coverage feedback already from the beginning. In particular I noticed it with the very first execution after boot, as you mentioned above. It is one of the reasons I added the -funky
mode long ago.
I've confirmed that @vient's patch removes the bug, introduced by 0014. Thanks!)
Ok, so for the slow ones among us can someone explain the problem ones more? I realized you write specifically about seed import. Indeed I see one of the slaves often stuck at import stage now and adding -funky
seems to resolve the issue. Is that what you mean?
Reverting 0014 and/or applying vient's patch still doesn't seem to help with that. Are you testing on TGL? One or multiple threads?
If I understand right, initial problem was partial test coverage caused by combination of PGD(0), latest get_ip_val and limit_check, not necessarily on seed import stage. For this problem we have 2.5 solutions: first one involves porting latest limit checks from libxdc, other 1.5 are either reverting 0014 patch or, additionally, applying my patch (regarding this, I'm waiting for answer to my libxdc issue so I can test the patch since previous get_ip_val implementations do not look specification conforming to me).
While looking for solutions @e13fter met another problem: first traces for initial seed differed too much so the seed was reported as funky. While on my computer first traces are different too, they are not so much different, and make equal bitmaps. I wasn't able to reproduce this issue and it seems that it is somehow resolved now.
Thanks. So in your comment above you mean disabling 0014 does not trigger the limit_check, thus not exposing the bug? I suppose we need both fixes since there is probably a reason for the extended limit_check() and the TGL fix is needed on TGL. Lets see what feedback you get, and I will check these on a TGL box when I get one.
I think we still did not manage to get to the root of this issue: what is exact meaning of those range checks? Without that understanding I can't reason about correctness of our patches.
Right now I can only say that all get_ip versions are incorrect in the IP compression type 0 part: this type means that no IP is provided, it is unknown. In current version last_ip is returned instead which triggers the check. In my version, 0 is returned in this case and, while 0 still being a valid IP value, it seems to be improbable to actually occur in any trace. By chance, 0 does not trigger the check but it is pure luck, we need to revise those checks properly or introduce a magic IP_NONE value(?).
Not sure if any of this is relevant as I don't have the time to fully investigate right now. But it should be noted that TNT packets can contain TNT bits "from the future" e.g. a TNT packet can contain TNT bits from after a FUP/TIP. I think some of the logic that are causing issues here is intended to deal with cases where we have to carefully disassemble/follow TNTs only until we reach the address where the FUP moved the control flow somewhere else (which can be in the middle of a basic block in the case of a interrupt). I also vaguely remember that we ran into issues where IP values couldn't be distinguished from potentially valid addresses (either 0 or fffffffff...) in intel PT. I think back then we simply ignored the issue.
I did not know about "future" TNTs, tnanks. However, isn't range check incomplete in this case? For example, there are backwards jumps sometimes. Isn't is possible to observe following sequence, ordered by linear addresses, where flow goes from last TNT to PGD: PGE - TNT - PGD - TNT? If I understand right, in this case last TNT won't pass range check but it actually is on the execution path of PGE - TNT - TNT - PGD.
It seems that we can stop decoding when next COFI in chain has address equal to PGD address. Since we can have omitted IP in PGDs, we can check against full trace range in this case? It becomes more complicated with several trace ranges but I don't know Intel PT well enough to reason about that without experiments. How is it resolved in official libipt?
Truth be told: I don't know, and in this case I have wondered about this myself. Generally it's reasonably save to assume that there is plenty of bugs in the decoder that didn't come up with our chips (or less likely, given the amount of usage that we got out of this, targets).
I think in libxdc we are actually checking that the current basic block does not contain the limit https://github.com/nyx-fuzz/libxdc/blob/master/src/disassembler.c#L503.
When launching test linux kernel module fuzzing, kAFL cannot detect crashes and seems to not update bitmaps after the first testcase. I slighly modified agent for it to output payload through hprintf and see that there are no "fixed" hit values in payload. Even values which are to increase coverage, as I know from the source code, are treated as all others. When I provide an "almost" crashing input, like "KERNELoFL" after some random mutations it leads to the crash, of course, and the fuzzer prints this:
If I provide second test case, the fuzzer takes it, runs and considers it to produce no new coverage and finally goes on with dumb mutations without any feedback.
I enabled verbose QEMU kAFL component logging via commenting this line in qemu-5.0.0/pt/debug.h
#define PT_DEBUG_DISABLE
and saw assembly code produced by capstone of qemu in the log.When disassembling the test module with objdump we see a little different listing.
The first machine instruction is disassembled not correctly by capstone in qemu. Also at the 0x3b and other
call
offsets capstone calculate wrong destination addresses of thecalls
. Could it affect the trace to bitmap conversion?My setup: Host OS: Ubuntu 20.04 (also tried Debian 10.2) Guest OS: Ubuntu 16.04 (
ro nokaslr nopti oops=panic mitigations=off
) CPU: Intel Core i7-8550U (Kaby Lake R)