DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.6k stars 554 forks source link

CRASH (2.0.0 test cases) #331

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 04, 2010 17:24:30

What steps will reproduce the problem? 1. Build DynamoRIO for amd64 release build with test enabled in Windows

  1. nmake test What is the expected output? What do you see instead? Many test cases failed. For release build, 58% tests passed, 29 tests failed out of 69 2:code_api|common.decode-bad 3:code_api|common.decode 4:code_api|common.decode-stress 8:code_api|common.protect-dstack 9:code_api|common.segfault 19:code_api|win32.protect-datasec 21:code_api|win32.reload 23:code_api|win32.reload-race 32:code_api|security-win32.apc-shellcode 33:code_api|security-win32.sd_tester 38:code_api|security-win32.TestNTFlush 41:code_api|security-common.jmp_from_trace 42:code_api|security-common.retexisting 43:code_api|security-common.ret_noncall_trace 44:code_api|security-common.retnonexisting 46:code_api|security-common.TestAllocWE 47:code_api|security-common.TestMemProtChg 50:code_api|client.call-retarget 51:code_api|client.cleancall 53:code_api|client.syscall 54:code_api|client.events 55:code_api|client.cbr3 56:code_api|client.cbr4 57:code_api|client.modules 61:code_api|client.thread 66:code_api,opt_memory|client.events 67:code_api,thread_private|client.events 68:code_api,disable_traces|client.events 69:code_api,thread_private,disable_traces|client.events

By checking the output at Testing/Temporary/LastTesting.log many experienced CRASH in DynamoRIO:

3/69 Testing: code_api|common.decode 3/69 Test: code_api|common.decode Command: "E:/Repository/dynamorio-read-only/build/bin/drrun.exe" "-s" "60" "-quiet" "-exit0" "-ops" "-stderr_mask 0xC -msgbox_mask 0 -dumpcore_mask 0xfd -staged -code_api " "-use_dll" "E:\Repository\dynamorio-read-only\build\lib\dynamorio.dll" "E:/Repository/dynamorio-read-only/build/suite/tests/bin/common.decode.exe" Directory: E:/Repository/dynamorio-read-only/build/suite/tests "code_api|common.decode" start time: Aug 04 16:04 Eastern Daylight Time Output:

Testing nops

Done with nops

Testing SSE3

Bad instruction, instance 1

Testing 3D-Now

<Application E:\Repository\dynamorio-read-only\build\suite\tests\bin\common.decode.exe (4808). Unrecoverable Error at PC 0x000000007109ac6e. Program aborted. 0x00000000c0000005 0x0000000000000000 0x000000007109ac6e 0x000000007109ac6e 0x0000000000000000 0x0000000000000000 Delta: 0x0000000000000000 Registers: eax=0x0000000019ca5560 ebx=0x0000000019cb5690 ecx=0x0000000019c05c00 edx=0x0000000019cf3300 esi=0x0000000019cb6170 edi=0x0000000019cf3300 esp=0x0000000019ca5498 ebp=0x0000000019c05c00 r8 =0x0000000019cb5690 r9 =0x0000000019cb6170 r10 =0x0000000000000001 r11 =0x0000000019ca5638 r12 =0x0000000001000000 r13 =0x0000000000000001 r14 =0x0000000000000001 r15 =0x0000000000000000 eflags=0x0000000000010202 version 2.0.0, build 18 -code_api -probe_api -dumpcore_mask 253 -stderr_mask 12 -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -staged -no_native_exec -no_indcall2direct -pad_jmps_mark_no_trace 0x0000000019c05c00 0x0000000000000005>

4/69 Testing: code_api|common.decode-stress 4/69 Test: code_api|common.decode-stress Command: "E:/Repository/dynamorio-read-only/build/bin/drrun.exe" "-s" "60" "-quiet" "-exit0" "-ops" "-stderr_mask 0xC -msgbox_mask 0 -dumpcore_mask 0xfd -staged -stress_recreate_state -code_api " "-use_dll" "E:\Repository\dynamorio-read-only\build\lib\dynamorio.dll" "E:/Repository/dynamorio-read-only/build/suite/tests/bin/common.decode.exe" Directory: E:/Repository/dynamorio-read-only/build/suite/tests "code_api|common.decode-stress" start time: Aug 04 16:04 Eastern Daylight Time Output:

Testing nops

Done with nops

Testing SSE3

Bad instruction, instance 1

Testing 3D-Now

<Application E:\Repository\dynamorio-read-only\build\suite\tests\bin\common.decode.exe (3532). Unrecoverable Error at PC 0x000000007109ac6e. Program aborted. 0x00000000c0000005 0x0000000000000000 0x000000007109ac6e 0x000000007109ac6e 0x0000000000000000 0x0000000000000000 Delta: 0x0000000000000000 Registers: eax=0x0000000016a55560 ebx=0x0000000016a65690 ecx=0x00000000169b5c00 edx=0x0000000016aa3300 esi=0x0000000016a66170 edi=0x0000000016aa3300 esp=0x0000000016a55498 ebp=0x00000000169b5c00 r8 =0x0000000016a65690 r9 =0x0000000016a66170 r10 =0x0000000000000001 r11 =0x0000000016a55638 r12 =0x0000000001000000 r13 =0x0000000000000001 r14 =0x0000000000000001 r15 =0x0000000000000000 eflags=0x0000000000010202 version 2.0.0, build 18 -code_api -probe_api -dumpcore_mask 253 -stderr_mask 12 -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -staged -no_native_exec -no_indcall2direct -pad_jmps_mark_no_trace -str 0x00000000169b5c00 0x0000000000000005>

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=331

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 04, 2010 14:43:48

The exception point (0x7109ac6e)is at the beginning of mangle_return, which is called from mangle.

dynamorio!mangle_return: 7109ac50 488bc4 mov rax,rsp 7109ac53 48895808 mov qword ptr [rax+8],rbx 7109ac57 48896810 mov qword ptr [rax+10h],rbp 7109ac5b 48897018 mov qword ptr [rax+18h],rsi 7109ac5f 57 push rdi 7109ac60 4881ecc0000000 sub rsp,0C0h 7109ac67 833dae16070000 cmp dword ptr [dynamorio!dynamo_options+0x31c (7110c31c)],0 7109ac6e 0f2970e8 movaps xmmword ptr [rax-18h],xmm6 7109ac72 498be9 mov rbp, r9 7109ac75 498bd8 mov rbx, r8 7109ac78 488bfa mov rdi,rdx 7109ac7b 488bf1 mov rsi,rcx 7109ac7e 751a jne dynamorio!mangle_return+0x4a (000000007109ac9a) 7109ac80 0fbaa424f000000018 bt dword ptr [rsp+0F0h],18h 7109ac89 720f jb dynamorio!mangle_return+0x4a (000000007109ac9a)

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 09:14:14

It looks more like a compiler bug. The instruction at 0x7109ac6e causes an access violation exception for misaligned access. The movaps requires accessing memory 16-byte aligned, the rax value was 0x19615560, which made [rax-18x] 0x19615548, NOT 16-byte aligned. So it is a compiler bug because it did not enforce the alignment.

There are several places have similar potential problem: instrlist_clone: 0x7105002B mangle_indirect_call: 0x7109AC01

In fact that instruction is not very useful, is there any way to avoid generating such code?

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 09:20:42

http://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention This explains why a Microsoft compiler generate the code for storing XMM6.

derekbruening commented 9 years ago

From derek.br...@gmail.com on August 05, 2010 09:52:52

no, this looks like a bug in DynamoRIO's stack setup. both sysv and ms x64 conventions have rsp at an 8 offset at func entry due to retaddr, so rax will be assumed to be 8-offs by the compiler, and rax-18h will be assumed to be 16-byte aligned. the call to dispatch must not be properly aligned or something.

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 10:34:49

I see. You mean the at any point right before function call in x64, the rsp is always 16 aligned?

The call stack of mangle_return is mangle_return mangle+0x37c mangle_bb_ilist+0xb2 build_bb_ilist+0x10e1 recreate_fragment_ilist+0x18e recreate_app_state_internal+0x34e recreate_app_state+0x14 intercept_exception+0x523 0x18cc1405

Looks like it is handling another exception.

derekbruening commented 9 years ago

From derek.br...@gmail.com on August 05, 2010 11:00:21

right. and in all but leaf functions or epilogues or prologues the rsp register is supposed to be 16-aligned on Windows for SEH purposes.

when I did the 64-bit port I ensured that exception interception keeps the stack 16-aligned, except there is a corner case where if already on dstack and at a spot where rsp was not 16-aligned. I didn't bother to add a dynamic test-and-align, under the assumption that none of our code would use aligned instrs, which here turns out to be false. is this exception while already on dstack? what exactly is it interrupting?

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 11:39:10

I use the decode-bad as an example: The instruction at 0x18a8c1f2: mov rax, dr0 causes an exception. Then the control transfer to ntdll:KiUserExceptionDispatcher at 0x77ef31c0. %rsp = 0x9bf960. The first instruction there is changed by DR to: jmp 0xc3002e.

The code at 0xc3002e: jmp [0xc30026], which points to 0x189f128d. 189f128d 6548890c2540000000 mov qword ptr gs:[40h],rcx 189f1296 65488b0c2548160000 mov rcx,qword ptr gs:[1648h] 189f129f e34e jrcxz 189f12ef 189f12a1 488b8910010000 mov rcx,qword ptr [rcx+110h] 189f12a8 483be1 cmp rsp,rcx 189f12ab 0f8d1b000000 jge 189f12cc 189f12b1 488d8900b0ffff lea rcx,[rcx-5000h] 189f12b8 483be1 cmp rsp,rcx 189f12bb 0f8c0b000000 jl 189f12cc 189f12c1 54 push rsp 189f12c2 6802000000 push 2 189f12c7 e99c000000 jmp 189f1368 189f12cc 65488b0c2548160000 mov rcx,qword ptr gs:[1648h] 189f12d5 48896118 mov qword ptr [rcx+18h],rsp 189f12d9 488ba110010000 mov rsp,qword ptr [rcx+110h] 189f12e0 488b4918 mov rcx,qword ptr [rcx+18h] 18915c18=00000000009bf960 189f12e4 51 push rcx 189f12e5 6801000000 push 1 189f12ea e979000000 jmp 189f1368

Later at 189f12d9, rsp is updated to 0x189b6000, the after two push, it jumps to 189f1368, and rsp is 189b5ff0:

189f1368 48b9940c000000000000 mov rcx,0C94h 189f1372 6548870c2540000000 xchg rcx,qword ptr gs:[40h] 189f137b 6800000000 push 0 189f1380 9c pushfq 189f1381 488d6424a0 lea rsp,[rsp-60h] 189f1386 f30f7f0424 movdqu xmmword ptr [rsp],xmm0 189f138b f30f7f4c2410 movdqu xmmword ptr [rsp+10h],xmm1 189f1391 f30f7f542420 movdqu xmmword ptr [rsp+20h],xmm2 189f1397 f30f7f5c2430 movdqu xmmword ptr [rsp+30h],xmm3 189f139d f30f7f642440 movdqu xmmword ptr [rsp+40h],xmm4 189f13a3 f30f7f6c2450 movdqu xmmword ptr [rsp+50h],xmm5 189f13a9 4157 push r15 189f13ab 4156 push r14 189f13ad 4155 push r13 189f13af 4154 push r12 189f13b1 4153 push r11 189f13b3 4152 push r10 189f13b5 4151 push r9 189f13b7 4150 push r8 189f13b9 50 push rax 189f13ba 51 push rcx 189f13bb 52 push rdx 189f13bc 53 push rbx 189f13bd 54 push rsp 189f13be 55 push rbp 189f13bf 56 push rsi 189f13c0 57 push rdi 189f13c1 6800000000 push 0 189f13c6 9d popfq 189f13c7 488b8424f8000000 mov rax,qword ptr [rsp+0F8h] 189f13cf 4889442418 mov qword ptr [rsp+18h],rax 189f13d4 65a16800000000000000 mov eax,dword ptr gs:[0000000000000068h] 189f13de 50 push rax 189f13df 48b8a2149f1800000000 mov rax,189F14A2h 189f13e9 50 push rax 189f13ea 48b80000000000000000 mov rax,0 189f13f4 50 push rax 189f13f5 488bc4 mov rax,rsp 189f13f8 488d6424e0 lea rsp,[rsp-20h] 189f13fd 488bc8 mov rcx,rax 189f1400 e8aba06c58 call dynamorio!intercept_exception (710bb4b0)

Clearly, right before call to intercept_exception, the rsp is changed from 189b5ff0 to 189b5ec8, which is not 16-byte aligned. And later the exception happens in mangle_return. I think the last 3 push makes the rsp not 16-byte aligned. Can we just change the last lea to lea rsp, [rsp-28h]?

derekbruening commented 9 years ago

From derek.br...@gmail.com on August 05, 2010 12:27:53

this code does a custom layout of app_state_at_intercept_t (which is why it's not using the existing clean call routines, which help ensure alignment) so be careful w/ any changes. a pointer to the struct is passed in, and the win64 4-arg padding is not a problem, so I believe that padding the 8 bytes at the same point would work: the alternative is to pad prior to the struct. be sure to adjust the post-call side as well, and make it clear in comments at least that these 8 bytes are not part of the 32-byte shadow space.

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 12:39:33

I just checked the debug build, it is not 16-btye aligned either. But mangle_return does not have the movaps instruction, so it does not report error.

I am wondering if it is a common problem or just happens for intercept_exception. Because if I change code in emit_intercept_code, it will not only affect intercept_exception, but also other code like intercept_callback_start, intercept_apc, ...

If it is a common problem, why did we not know before? In Windows, all different kind of events should happen very frequently. If it is specific for intercept_exception, it is not good to change at emit_intercept_code.

derekbruening commented 9 years ago

From derek.br...@gmail.com on August 05, 2010 12:53:50

emit_intercept_code emits the code for intercepting apcs, callbacks, exceptions, etc., so it is the right place to change

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 13:04:29

I think it is more safe to shift 8 byte before the app_state_at_intercept_t by adding "lea rsp, [rsp-8h]" before "189f137b: push 0", and then change "lea rsp, [rsp+8h]" to "lea rsp, [rsp + 16h]" later after popfq.

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 05, 2010 13:08:58

It seems the PC slot is not used, "189f137b: push 0", shall we just replace it with "lea rsp, [rsp-16h]"

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 06, 2010 07:34:41

replace "push 0" won't work, because the intercept_exception will access the stack beyond the app_state_at_intecept.

Also, another reason causes many test cases fail in release build is the compiler produce optimized code. For example, the 51/68 cleancall test case, the test code should have a "nop, nop, call" sequence to identify the position. However, the compiler optimized away the nop operation and causes the failure.

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 10, 2010 10:25:22

Another failure is caused by file path. In events.dll.c:307 module_load_event_perm, it does not expect server-style path, but in release build it sees \Device\HarddiskVolume2\WINDOWS\system32\shell32.dll \Device\HarddiskVolume2\WINDOWS\system32\msvcrt.dll \Device\HarddiskVolume2\WINDOWS\system32\gdi32.dll \Device\HarddiskVolume2\WINDOWS\system32\user32.dll ... While in debug build it sees C:\WINDOWS\system32\shell32.dll C:\WINDOWS\system32\msvcrt.dll C:\WINDOWS\system32\gdi32.dll C:\WINDOWS\system32\user32.dll C:\WINDOWS\system32\advapi32.dll C:\WINDOWS\system32\rpcrt4.dll C:\WINDOWS\system32\secur32.dll C:\WINDOWS\system32\shlwapi.dll C:\WINDOWS\system32\imm32.dll C:\WINDOWS\system32\lpk.dll C:\WINDOWS\system32\usp10.dll

So we should remove the check if it is a release build

derekbruening commented 9 years ago

From qin.zhao@gmail.com on August 10, 2010 11:32:13

For test case 41 - code_api|security-common.jmp_from_trace (Failed) It seems the release build fail to register the exception handler for the executable, and lose the control, and the native code executed directly at its own place.