IntelLabs / kAFL

A fuzzer for full VM kernel/driver targets
https://intellabs.github.io/kAFL/
MIT License
658 stars 91 forks source link

No bitmap updates in linux kernel fuzzing #32

Closed e13fter closed 2 years ago

e13fter commented 3 years ago

When launching test linux kernel module fuzzing, kAFL cannot detect crashes and seems to not update bitmaps after the first testcase. I slighly modified agent for it to output payload through hprintf and see that there are no "fixed" hit values in payload. Even values which are to increase coverage, as I know from the source code, are treated as all others. When I provide an "almost" crashing input, like "KERNELoFL" after some random mutations it leads to the crash, of course, and the fuzzer prints this:

[HPRINTF]   00000000: 4b 45 52 4e 45 4c 5b 46 4c 0a d3 d3 20 e4 e4 fa "KERNEL[FL... ..."
[HPRINTF]   00000000: 4b 45 52 4e 45 4c 5c 46 4c 0a d3 d3 20 e4 e4 fa "KERNEL\FL... ..."
[HPRINTF]   00000000: 4b 45 52 4e 45 4c 42 46 4c 0a d3 d3 20 e4 e4 fa "KERNELBFL... ..."
[HPRINTF]   00000000: 4b 45 52 4e 45 4c 5d 46 4c 0a d3 d3 20 e4 e4 fa "KERNEL]FL... ..."
[HPRINTF]   00000000: 4b 45 52 4e 45 4c 41 46 4c 0a d3 d3 20 e4 e4 fa "KERNELAFL... ..."
Crashing input found (crash), but not new (discarding) 0

If I provide second test case, the fuzzer takes it, runs and considers it to produce no new coverage and finally goes on with dumb mutations without any feedback.

[HPRINTF]   00000000: 4b 45 52 4e 45 4c ff ff ff 7f d3 d3 20 e4 e4 fa "KERNEL...... ..."
Importing payload from /home/user/Documents/fuzz/tools/kafl/fuzz/imports/1
[HPRINTF]   00000000: 53 45 52 31 32 33 0a ff ff 7f d3 d3 20 e4 e4 fa "SER123...... ..."
Imported payload produced no new coverage, skipping..
[HPRINTF]   00000000: 00 20 52 31 32 33 0a ff ff 7f d3 d3 20 e4 e4 fa ". R123...... ..."

I enabled verbose QEMU kAFL component logging via commenting this line in qemu-5.0.0/pt/debug.h #define PT_DEBUG_DISABLE and saw assembly code produced by capstone of qemu in the log.

[QEMU-PT] Diasm: Analyse ASM: ffffffffc02b2000 (4096), max_addr=ffffffffc02b2300
[QEMU-PT] Diasm: Loop: ffffffffc02b2000:        nop     dword ptr [rax + rax], last_nop=0
[QEMU-PT] Diasm: Loop: ffffffffc02b2005:        push    rbp, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2006:        mov     rbp, rsp, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2009:        push    r13, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b200b:        push    r12, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b200d:        push    rbx, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b200e:        mov     r12, rsi, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2011:        mov     rbx, rdx, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2014:        mov     esi, 0x24000c0, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2019:        mov     edx, 0x534, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b201e:        sub     rsp, 0x108, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2025:        mov     rdi, qword ptr [rip - 0x3e0d4b54], last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b202c:        mov     rax, qword ptr gs:[0x28], last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2035:        mov     qword ptr [rbp - 0x20], rax, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2039:        xor     eax, eax, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b203b:        call    0xffffffff811fb980, last_nop=1
[QEMU-PT] Redq.: insert hook call ffffffffc02b203b
[QEMU-PT] Diasm: Loop: ffffffffc02b2040:        cmp     rbx, 0xff, last_nop=0
[QEMU-PT] Diasm: Loop: ffffffffc02b2047:        ja      0xffffffffc02b218c, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b204d:        lea     rdi, qword ptr [rbp - 0x120], last_nop=0
[QEMU-PT] Redq.: got boring index
[QEMU-PT] Diasm: Loop: ffffffffc02b2054:        mov     edx, ebx, last_nop=1
[QEMU-PT] Diasm: Loop: ffffffffc02b2056:        mov     rsi, r12, last_nop=1

When disassembling the test module with objdump we see a little different listing.

# objdump -d -M intel kafl_vuln_test.ko

kafl_vuln_test.ko:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <write_info>:
   0:   e8 00 00 00 00          call   5 <write_info+0x5>
   5:   55                      push   rbp
   6:   48 89 e5                mov    rbp,rsp
   9:   41 55                   push   r13
   b:   41 54                   push   r12
   d:   53                      push   rbx
   e:   49 89 f4                mov    r12,rsi
  11:   48 89 d3                mov    rbx,rdx
  14:   be c0 00 40 02          mov    esi,0x24000c0
  19:   ba 34 05 00 00          mov    edx,0x534
  1e:   48 81 ec 08 01 00 00    sub    rsp,0x108
  25:   48 8b 3d 00 00 00 00    mov    rdi,QWORD PTR [rip+0x0]        # 2c <write_info+0x2c>
  2c:   65 48 8b 04 25 28 00    mov    rax,QWORD PTR gs:0x28
  33:   00 00 
  35:   48 89 45 e0             mov    QWORD PTR [rbp-0x20],rax
  39:   31 c0                   xor    eax,eax
  3b:   e8 00 00 00 00          call   40 <write_info+0x40>
  40:   48 81 fb ff 00 00 00    cmp    rbx,0xff
  47:   0f 87 3f 01 00 00       ja     18c <write_info+0x18c>

The first machine instruction is disassembled not correctly by capstone in qemu. Also at the 0x3b and other call offsets capstone calculate wrong destination addresses of the calls. Could it affect the trace to bitmap conversion?

My setup: Host OS: Ubuntu 20.04 (also tried Debian 10.2) Guest OS: Ubuntu 16.04 (ro nokaslr nopti oops=panic mitigations=off) CPU: Intel Core i7-8550U (Kaby Lake R)

vient commented 3 years ago

The code looks a bit strange even in objdump, why are there two CALL $+5 instructions? These should not usually be present on x86_64. My only guess is that those instructions are modified by relocations and objdump shows you unmodified version which is meaningless, you can check this by passing -r additionally to objdump (if I understood objdump's man page right).

e13fter commented 3 years ago

I don't know why there are these instructions, looks strange. When disassembling with objdump -r, they are still there, but seems like objdump hints that the real entry point is the instruction next to this call $+5:

0000000000000000 <write_info>:
   0:   e8 00 00 00 00          call   5 <write_info+0x5>
            1: R_X86_64_PC32    __fentry__-0x4
   5:   55                      push   rbp
   6:   48 89 e5                mov    rbp,rsp
   9:   41 55                   push   r13
   b:   41 54                   push   r12
   d:   53                      push   rbx
   e:   49 89 f4                mov    r12,rsi
  11:   48 89 d3                mov    rbx,rdx
  14:   be c0 00 40 02          mov    esi,0x24000c0
  19:   ba 34 05 00 00          mov    edx,0x534
  1e:   48 81 ec 08 01 00 00    sub    rsp,0x108
  25:   48 8b 3d 00 00 00 00    mov    rdi,QWORD PTR [rip+0x0]        # 2c <write_info+0x2c>
            28: R_X86_64_PC32   kmalloc_caches+0x54
  2c:   65 48 8b 04 25 28 00    mov    rax,QWORD PTR gs:0x28
  33:   00 00 
  35:   48 89 45 e0             mov    QWORD PTR [rbp-0x20],rax
  39:   31 c0                   xor    eax,eax
  3b:   e8 00 00 00 00          call   40 <write_info+0x40>
            3c: R_X86_64_PC32   kmem_cache_alloc_trace-0x4
  40:   48 81 fb ff 00 00 00    cmp    rbx,0xff
e13fter commented 3 years ago

Actually, the old RUB SysSec's kAFL works on my hardware and I've retrieved decoder logs from working RUB SysSec's kAFL and from not working IntelLabs' kAFL.

IntelLabs' one (kafl_vuln_test.ko is located at 0xffffffffc02b2000-0xffffffffc02b6000):

PSB
MODE
PIP 3896b000
VMCS
PSBEND
MODE
PGE     ffffffffc02b2000
PGD     ffffffffc02b2000
PSB
MODE
MODE
PIP 3896b000
VMCS
PSBEND
PGE     ffffffffc02b2000
PGD     ffffffffc02b2000

disasm(ffffffffc02b2000,ffffffffc02b2000)   TNT: 0
PGE     ffffffffc02b2040
PGD     ffffffffc02b2040

disasm(ffffffffc02b2040,ffffffffc02b2040)   TNT: 1
PGE     ffffffffc02b2061
PGD     ffffffffc02b2061

disasm(ffffffffc02b2061,ffffffffc02b2061)   TNT: 9
PGE     ffffffffc02b2081
PGD     ffffffff8128d195

disasm(ffffffffc02b2081,ffffffff8128d195)   TNT: 10
(0) ffffffffc02b2091    (Not Taken)
(3) ffffffffc02b20a4
PSB
MODE
PIP 3896b000
VMCS
PSBEND
PSB
MODE
PIP 3896b000
VMCS
PSBEND
    TNT 9 (PGE -2128031339)

RUB SysSec's one (kafl_vuln_test.ko at 0xffffffffc02cb000-0xffffffffc02cf000):

MODE
PIP 166a3000
VMCS
MODE
TIP.PGE ffffffffc02cb000
TIP.PGD 0 (0)
TIP.PGE ffffffffc02cb040
(0) ffffffffc02cb047    (Not Taken)
(1) ffffffffc02cb05c
TIP.PGD 0 (0)
TIP.PGE ffffffffc02cb061
(0) ffffffffc02cb064    (Not Taken)
(0) ffffffffc02cb073    (Taken)
(0) ffffffffc02cb0e7    (Taken)
(0) ffffffffc02cb12c    (Not Taken)
(0) ffffffffc02cb139    (Not Taken)
(0) ffffffffc02cb146    (Not Taken)
(0) ffffffffc02cb153    (Not Taken)
(0) ffffffffc02cb160    (Taken)
TIP.PGD 0 (0)
TIP.PGE ffffffffc02cb081
(0) ffffffffc02cb091    (Not Taken)
(3) ffffffffc02cb0a4
TIP.PGD ffffffff8128d195 (0)

The suspictious thing here is that in IntelLabs' log the PT tracing is enabled (TIP.PGE) and at the same moment disabled (TIP.PGD) at 0xffffffffc02cb061 while in RUB SysSec's one it is enabled at 0xffffffffc02cb061 and disabled later and there are branch instruction log between these events.

Does it indicate that no branch instructions written by PT and the bitmap is empty because no branch transitions written?

I retrieved traces with this command for the IntelLabs' kAFL:

kAFL-Fuzzer/kafl_debug.py -vm_dir snapshot -agent targets/linux_x86_64/bin/fuzzer/kafl_vuln_test -mem 1024 -input in/1 -work_dir fuzz -ip0 0xffffffffc02b2000-0xffffffffc02b6000 -v --debug -action trace -trace

While for the RUB SysSec's kAFL they were dumped on the storage when fuzzing:

python2 kafl_fuzz.py ../snapshot/ram.qcow2 ../snapshot agents/linux_x86_64/fuzzer/kafl_vuln_test 512 ../in ../fuzz -ip0 0xffffffffc02cb000-0xffffffffc02cf000 -v --Purge

I attach the assembler code of the kafl_vuln_test.ko module retrieved by objdump. kafl_vuln_test.txt

vient commented 3 years ago

but seems like objdump hints that the real entry point is the instruction next to this call $+5

1: R_X86_64_PC32 __fentry__-0x4 relocation info means that linker will take __fentry__ function address relatively to address 1, subtract 4 and place this 32-bit value at address 1, so the first instruction after relocations is actually call __fentry__. Similarly, second call $+5 becomes call kmem_cache_alloc_trace.

in IntelLabs' log the PT tracing is enabled (TIP.PGE) and at the same moment disabled (TIP.PGD) at 0xffffffffc02cb061

It really looks strange.


As a matter of fact, I planned for some time to take a look at kAFL again, try to fix bugs. Do you mind sharing your setup so I can too test it? Compiled driver, modifications that you made (agent, ..?), etc.

vient commented 3 years ago

Remembered again about __fentry__ missing in Capstone output: it is actually correct, __fentry__ calls are patched out on module loading (more info here). Which gives that Capstone output is overall correct.

e13fter commented 3 years ago

@vient, sorry for being stupid about the relocations. Didn't dig in them. Thank you for the __fentry__ remark.

If you need my agent hexdump modification, let me know, I don't have it right now. For debugging purposes I only uncommented some defines, I attach appropriate patches here.

Also, don't forget to create the /tmp/traces folder, it isn't created automatically. It will contained raw traces and decoded ones.

debug.zip

vient commented 3 years ago

Commenting out a macro at qemu-5.0.0/pt/disassembler.c:25 like this seems to help a lot:

#define limit_check(prev, next, limit) true //(!((limit >= prev) & (limit <= next)))

It uncovers bugs with crash detection (panics are caught but discarded as previously seen) but at least full regular coverage is obtained.

It seems that empty PGD packets are pretty common (as can be seen here), and mentioned check makes kAFL to stop branch processing when next branch is outside of [prev, limit) range. When PGD is empty, limit becomes equal to prev so this check can not be passed at all.

On the other hand, I can't understand from documentation how empty PGD packets are possible in this case (jumping out of IP range). Quote from docs:

... [PGD] will include the IP at which the tracing ends, unless ContextEn= 0 or TraceEn=0 at the conclusion of the instruction or event that cleared PacketEn.

ContextEn depends on CPL and CR3 value, which do not change in our case. TraceEn is controlled by us and is surely enabled.

Not sure how to resolve this situation properly, @il-steffen what do you think?

vient commented 3 years ago

Lurking a bit more, there is an errata 014 with a promising title "Intel PT TIP.PGD May Not Have Target IP Payload", which spans 6-9th generations at least.

Besides, even if we had an IP value in PGD, this value points to COFI target, not source. What's the point in interpreting it as a range's end? It seems that we can drop this limit_check entirely. My line of reasoning goes like this: we went in range on PGE, all TNTs came in while we were in range, then we went out of range with PGD, which gives that all TNTs should be processable without range checks (if our disassembler is good enough).

Edit: logic looks unchanged in libxdc, except they have an additional check for this exact case, limit == start. Usage of the whole limit_check is a bit confusing to me though so I can't derive its usefulness.

vient commented 3 years ago

panics are caught but discarded as previously seen

It seems to be caused by race between kAFL and QEMU. QEMU immediately informs kAFL about panics, after that kAFL immediately proceeds to creation of a local bitmap copy without waiting for QEMU to fill shared memory one first. As a result, QEMU produces correct bitmap in shared memory but it is too late, kAFL have already copied empty bitmap which is then used to deduce that such crash has already occured.

As a temporary hack time.sleep(1) may be added after line https://github.com/IntelLabs/kAFL/blob/a7a8ae3e5ccd044ac185c94e11bc152f8c0a231c/kAFL-Fuzzer/common/qemu.py#L637. As a more robust solution I propose an addition to QEMU protocol, an opcode BITMAP_FILLED. Maybe there already is something like this in the protocol? I still have some troubles understanding it, to be honest.

il-steffen commented 3 years ago

On the race condition, does it help to add synchronization_disable_pt() in qemu/pt/hypercall.c before hypercall_snd_char(KAFL_PROTO_CRASH) ?

il-steffen commented 3 years ago

Regarding the disassembly/decode, there have been only two significant changes since the original kAFL. One is from kAFL to Redqueen (which this IntelLabs/kAFL is based on), and the other is from this patch: https://github.com/IntelLabs/kAFL/blob/master/patches/qemu/v5.0.0/0012-read-guest-memory-on-PT-decode-when-more-likely-paged.patch

I suggest to try without the patch 0012. I'm not sure if it actually improved anything, but I rarely use the Linux VM target.

il-steffen commented 3 years ago

Edit: logic looks unchanged in libxdc, except they have an additional check for this exact case, limit == start. Usage of the whole limit_check is a bit confusing to me though so I can't derive its usefulness.

Nice catch! Merging the function from libxdc seems to fix coverage feedback for the linux target. Can you confirm?

Also, it seems that adding synchronization_disable_pt() to the crash/kasan hypercall handlers improves the ability of the fuzzer to detect those. Can someone confirm? There is already a corresponding command KAFL_PROTO_FINALIZE in pt/interface.c which was apparently intended to do the same thing on timeout events.

PS: I also had to remove the very first call to limit_check(), it seems to be missing in libxdc as well.

vient commented 3 years ago

Merging the function from libxdc seems to fix coverage feedback for the linux target. Can you confirm?

How exactly did you merge this function?

adding synchronization_disable_pt()

I'll try it with crash handler, don't have kASAN kernel at the moment to test with.

vient commented 3 years ago

On the race condition, does it help to add synchronization_disable_pt() in qemu/pt/hypercall.c before hypercall_snd_char(KAFL_PROTO_CRASH) ?

Yes, looks like it fixed the problem with crashes.

il-steffen commented 3 years ago

limit_check.patch.gz

vient commented 3 years ago

Yes, this patch seems to work. I still wonder what range_check is exactly doing. Is it some kind of sanity check? I've turned on DEBUG_TRACE_RETURN and check_return is never called so range_check is always passed.

On another note, I've hit yet another bug in disassembler.c, this time a lookup_area buffer overflow. Note that it's size is max_addr - min_addr and in map_put/map_get we use self->max_addr - addr as index. What happens when addr is equal to self->min_addr (apparently can occur on Linux) is an access to element past array end, which, coincidentally, is not zero and we get SIGSEGV on line https://github.com/IntelLabs/kAFL/blob/82f93f5acac7d1dfa0510a4b649abd321bd0270f/patches/qemu/v5.0.0/0012-read-guest-memory-on-PT-decode-when-more-likely-paged.patch#L441

My fix was to use addr - self->min_addr index instead.

il-steffen commented 3 years ago

The range of addr seems to be limited by out_of_bounds() and could also be as large as max_addr - would this not again cause an OOB with your fix? How about incrementing the allocated area instead?

You'll have to ask @schumilo about the deeper purpose of limit_check().

vient commented 3 years ago

could also be as large as max_addr

It's not clear from the docs if the IP written to IA32_RTIT_ADDRn_B is included in range or not. In any case, adding one more element won't hurt. I assumed that this address is not included.

il-steffen commented 3 years ago

Uploaded a testing branch with these fixes, plus a patch for hang on snapshot creation and hypercalls getting lost on snapshot load. Let me know how it works.

e13fter commented 3 years ago

Thanks, @vient! I confirmed the code with your changes solved the problem. Thanks, @il-steffen, too! Your fixes work too.

e13fter commented 3 years ago

I have another problem now. When providing only one initial testcase, it doesn't pass quick validation and the fuzzer hangs waiting for other seeds. When there are second seed, it always passes the validation and the fuzzing starts. It doesn't depend on the testcase itself, the behaviour is always the same. Providing the -funky option let the fuzzing begin. When debugging with the -funky option, I noticed that only the very first out of 8 validation tries fails the validation. I see the FUP packet between the PGE and PGD packets, that causes PGD handler to execute the "unlikely" block and call the disasm function twice. https://github.com/IntelLabs/kAFL/blob/a7a8ae3e5ccd044ac185c94e11bc152f8c0a231c/patches/qemu/v5.0.0/0001-Reworked-patch-from-https-github.com-RUB-SysSec-redq.patch#L2101 It results in filling the bitmap twice also. @il-steffen, is it right behaviour? As a temporal fix, I removed the disasm call from the "unlikely" block and it allowed the validation pass.

e13fter commented 3 years ago

I've just checked this on the newest KVM-PT with the very first qemu-3.1.0 version of QEMU-PT. And it turned out there is no such problem there. One initial seed is validated correctly and the fuzzing starts. Also there is no FUP packet in traces.

il-steffen commented 3 years ago

Hi @e13fter. Using latest KVM 5.11.4 and Qemu 5.0 patches I can't conclusively tell if this improves the situation. I do see less 'funky' events in the debug.log but it also seems that we don't manage to explore the full kafl_vuln_test.ko sample anymore.

If that 3.1 version works, can you try to narrow down the place where the bug was introduced? kafl_debug.py may also be useful to capture a few consecutive traces.

il-steffen commented 3 years ago

Hmm, reverting this does not seem to help here.

Can you check with kafl_debug.py -action noise if this really fixes the initial trace for you? The tool allows you to run an individual input multiple times and compare the traces. You should also see some general 0.1% noise over time depending on the target.

My targets tend to be non-deterministic so the -funky option was a better workaround for me. All of this is also fixed in Nyx but I cannot say when RUB will publish it..

e13fter commented 3 years ago

Sorry, I was wrong, that is why I deleted the comment. It seems that the fix with limit_check causes the validation failure. Without the fix it doesn't gather coverage, with it - the problem with validation appears. I'm trying to work on it now. And by the way, commit https://github.com/IntelLabs/kAFL/commit/fff597770f7286c521a7aac6744a21841f425572 seems to introduce the initial problems with coverage. Removing it wipes all problems even without our further patches.

vient commented 3 years ago

I confirm that without patch 0014 check_return is not triggered in disassembler.c.

What's more strange is that I made my own version of this 0014 patch by following docs and it does not trigger check_return too. Maybe 0014 is itself buggy?

diff --git a/pt/decoder.c b/pt/decoder.c
index 03d3b687..2773fede 100644
--- a/pt/decoder.c
+++ b/pt/decoder.c
@@ -300,36 +300,41 @@ static inline void decoder_handle_fup(decoder_state_machine_t *self, uint64_t ad
    }
 }

-static inline uint64_t get_ip_val(uint8_t **pp, uint8_t *end, uint8_t len, uint64_t *last_ip){
+static inline uint64_t get_ip_val(uint8_t **pp, uint8_t *end, uint8_t type, uint64_t *last_ip){
    uint8_t *p = *pp;
    uint64_t v = *last_ip;
-   uint8_t i;
-   uint8_t shift = 0;

-   switch(len){
+   switch(type){
        case 0:
-           v = 0;
-           break;
+           return 0;
        case 1:
+           v &= ~0xFFFFull;
+           v |= *(uint16_t*)p;
+           *pp += 2;
+           break;
        case 2:
+           v &= ~0xFFFFFFFFull;
+           v |= *(uint32_t*)p;
+           *pp += 4;
+           break;
        case 3:
-           if (unlikely(!LEFT(len))) {
-               *last_ip = 0;
-               v = 0;
-               break;
-           }
-           for (i = 0; i < len; i++, shift += 16, p += 2) {
-               uint64_t b = *(uint16_t *)p;
-               v = (v & ~(0xffffULL << shift)) | (b << shift);
-           }
-           v = ((int64_t)(v << (64 - 48))) >> (64 - 48); /* sign extension */
-           *pp = p;
-           *last_ip = v;
+           v = *(uint64_t*)p & 0xFFFFFFFFFFFFull;
+           *pp += 6;
+           v = (int64_t)(v << 16) >> 16;   // sign extension
            break;
-       default:
-           v = 0;
+       case 4:
+           v &= ~0xFFFFFFFFFFFFull;
+           v |= *(uint64_t*)p & 0xFFFFFFFFFFFFull;
+           *pp += 6;
            break;
+       case 6:
+           v = *(uint64_t*)p;
+           *pp += 8;
+           break;
+       default:
+           assert(false);
    }
+   *last_ip = v;
    return v;
 }

@@ -551,6 +556,7 @@ static inline void pip_handler(decoder_t* self, uint8_t** p){
                            #endif
                            break;
                        case PT_PKT_PSB_BYTE1:
+                           self->last_tip_tmp = 0;
                            p += PT_PKT_PSB_LEN;
                            WRITE_SAMPLE_DECODED_DETAILED("PSB\n");
                            #ifdef DECODER_LOG
il-steffen commented 3 years ago

It seems that the fix with limit_check causes the validation failure. Without the fix it doesn't gather coverage, with it - the problem with validation appears.

These are probably independent issues. The problem with stability/noise of the coverage feedback already from the beginning. In particular I noticed it with the very first execution after boot, as you mentioned above. It is one of the reasons I added the -funky mode long ago.

e13fter commented 3 years ago

I've confirmed that @vient's patch removes the bug, introduced by 0014. Thanks!)

il-steffen commented 3 years ago

Ok, so for the slow ones among us can someone explain the problem ones more? I realized you write specifically about seed import. Indeed I see one of the slaves often stuck at import stage now and adding -funky seems to resolve the issue. Is that what you mean?

Reverting 0014 and/or applying vient's patch still doesn't seem to help with that. Are you testing on TGL? One or multiple threads?

vient commented 3 years ago

If I understand right, initial problem was partial test coverage caused by combination of PGD(0), latest get_ip_val and limit_check, not necessarily on seed import stage. For this problem we have 2.5 solutions: first one involves porting latest limit checks from libxdc, other 1.5 are either reverting 0014 patch or, additionally, applying my patch (regarding this, I'm waiting for answer to my libxdc issue so I can test the patch since previous get_ip_val implementations do not look specification conforming to me).

While looking for solutions @e13fter met another problem: first traces for initial seed differed too much so the seed was reported as funky. While on my computer first traces are different too, they are not so much different, and make equal bitmaps. I wasn't able to reproduce this issue and it seems that it is somehow resolved now.

il-steffen commented 3 years ago

Thanks. So in your comment above you mean disabling 0014 does not trigger the limit_check, thus not exposing the bug? I suppose we need both fixes since there is probably a reason for the extended limit_check() and the TGL fix is needed on TGL. Lets see what feedback you get, and I will check these on a TGL box when I get one.

vient commented 3 years ago

I think we still did not manage to get to the root of this issue: what is exact meaning of those range checks? Without that understanding I can't reason about correctness of our patches.

Right now I can only say that all get_ip versions are incorrect in the IP compression type 0 part: this type means that no IP is provided, it is unknown. In current version last_ip is returned instead which triggers the check. In my version, 0 is returned in this case and, while 0 still being a valid IP value, it seems to be improbable to actually occur in any trace. By chance, 0 does not trigger the check but it is pure luck, we need to revise those checks properly or introduce a magic IP_NONE value(?).

eqv commented 3 years ago

Not sure if any of this is relevant as I don't have the time to fully investigate right now. But it should be noted that TNT packets can contain TNT bits "from the future" e.g. a TNT packet can contain TNT bits from after a FUP/TIP. I think some of the logic that are causing issues here is intended to deal with cases where we have to carefully disassemble/follow TNTs only until we reach the address where the FUP moved the control flow somewhere else (which can be in the middle of a basic block in the case of a interrupt). I also vaguely remember that we ran into issues where IP values couldn't be distinguished from potentially valid addresses (either 0 or fffffffff...) in intel PT. I think back then we simply ignored the issue.

vient commented 3 years ago

I did not know about "future" TNTs, tnanks. However, isn't range check incomplete in this case? For example, there are backwards jumps sometimes. Isn't is possible to observe following sequence, ordered by linear addresses, where flow goes from last TNT to PGD: PGE - TNT - PGD - TNT? If I understand right, in this case last TNT won't pass range check but it actually is on the execution path of PGE - TNT - TNT - PGD.

It seems that we can stop decoding when next COFI in chain has address equal to PGD address. Since we can have omitted IP in PGDs, we can check against full trace range in this case? It becomes more complicated with several trace ranges but I don't know Intel PT well enough to reason about that without experiments. How is it resolved in official libipt?

eqv commented 3 years ago

Truth be told: I don't know, and in this case I have wondered about this myself. Generally it's reasonably save to assume that there is plenty of bugs in the decoder that didn't come up with our chips (or less likely, given the amount of usage that we got out of this, targets).

eqv commented 3 years ago

I think in libxdc we are actually checking that the current basic block does not contain the limit https://github.com/nyx-fuzz/libxdc/blob/master/src/disassembler.c#L503.