Compatibility study for Google Cloud Platform

icedevml commented 4 years ago

Check whether drakvuf-sandbox is feasible on GCP on at least one of the supported systems. Info needed: whether some custom hacks/adjustments are needed and if we could document them/implement some improvements dedicated for GCP.

icedevml commented 4 years ago

Info from @chivay: The builtin kernel has no Xen Dom0 support. On a Debian Buster, Linux 5.6 generic kernel ( https://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D ) with Windows 10 guest it succeeds to boot Windows properly.

chivay commented 4 years ago

By following this tutorial I've managed to get the Windows VM up and running: https://cloud.google.com/compute/docs/instances/enable-nested-virtualization-vm-instances

Used instance image: Debian 10 with custom kernel package (CONFIG_XEN_DOM0=y is required) Used Windows ISO: Windows 10 1909 (MD5: d1f08aea37586702f6fbe2fe3ea8c3fd )

Nested hypervisors other than KVM on GCP are not supported so YMMV.

The process failed during draksetup postinstall while running vmi-win-offsets step. It timed out after 30s. Rerunning it with 60s timeout resulted in died with <Signals.SIGKILL: 9> which seems to be a result of a bug (?) - After running for about 25 seconds, memory usage quickly goes up to 100% (dom0 had 4GB).

Additionally, kernel seems to be a little bit unhappy about all this (I understand the OOM and stacktrace but the bad RIP seems weird):

[13626.972400] CPU: 0 PID: 880 Comm: redis-server Not tainted 5.6.0-050600-generic #202003292333
[13626.972401] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[13626.972401] Call Trace:
[13626.972413]  dump_stack+0x6d/0x9a
[13626.972418]  dump_header+0x4f/0x1eb
[13626.972420]  oom_kill_process.cold+0xb/0x10
[13626.972422]  out_of_memory.part.0+0x1df/0x430
[13626.972424]  out_of_memory+0x6d/0xd0
[13626.972427]  __alloc_pages_slowpath+0xd24/0xe40
[13626.972430]  __alloc_pages_nodemask+0x2c6/0x300
[13626.972433]  alloc_pages_current+0x87/0xe0
[13626.972434]  __page_cache_alloc+0x72/0x90
[13626.972436]  pagecache_get_page+0xbf/0x300
[13626.972437]  filemap_fault+0x69a/0xa40
[13626.972439]  ? filemap_map_pages+0x24c/0x380
[13626.972442]  ext4_filemap_fault+0x32/0x46
[13626.972445]  __do_fault+0x3c/0x130
[13626.972451]  do_fault+0x24b/0x640
[13626.972453]  __handle_mm_fault+0x5b5/0x850
[13626.972454]  handle_mm_fault+0xca/0x200
[13626.972457]  do_user_addr_fault+0x1f9/0x450
[13626.972459]  do_page_fault+0x6a/0x140
[13626.972462]  page_fault+0x34/0x40
[13626.972465] RIP: e033:0x7fb227e7c7ef
[13626.972474] Code: Bad RIP value.
[13626.972474] RSP: e02b:00007fffb437cbe0 EFLAGS: 00010293
[13626.972476] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007fb227e7c7ef
[13626.972476] RDX: 0000000000002790 RSI: 00007fb2277360c0 RDI: 0000000000000005
[13626.972477] RBP: 00007fb2277360c0 R08: 0000000000000000 R09: 00000000000032b4
[13626.972478] R10: 0000000000000064 R11: 0000000000000293 R12: 0000000000002790
[13626.972478] R13: 0000000000000064 R14: 0000000000000002 R15: 0000000000000000

icedevml commented 4 years ago

The process failed during draksetup postinstall while running vmi-win-offsets step. It timed out after 30s.

I've reproduced this behavior outside GCP, this is just the issue of LibVMI not detecting something in the kernel. We need to further debug what exactly.

In general, it is possible to clone DRAKVUF sources, perform the LibVMI installation as described in https://drakvuf.com/ (plus enable debugging mode when compiling LibVMI) and then just make install - the local LibVMI compilation would override the packaged one and we will have more debug logs.

icedevml commented 4 years ago

For sure there must be some problem with how vmi-win-offsets works. I've found KPGD manually and I'm able to run drakvuf against this Windows 10 using:

drakvuf -d vm-0 -k 0x1aa002 -r /var/lib/drakrun/profiles/kernel.json

vmi-win-offsets also succeeds to do so, but this is not always the case. When it doesn't succeed, it enters some crazy loop of scanning the whole physical memory. I bet "fail fast" approach would be better for our case.

icedevml commented 4 years ago

After commenting paths leading to full memory scan in LibVMI mode, the success rate of vmi-win-offsets increases significantly. Also with such modification it's possible to retry in case it doesn't succeed to find kernel for the first time.

We need to study how vmi-win-offsets works and what are the options to improve it. For a quick solutions:

terminating vmi-win-offsets (as we do now) simply leaves the VM paused, because there is no corresponding xc_domain_unpause call, so the pause counter is left >0, maybe we should just implement a graceful signal handler that would unpause the VM prior to user-induced-terminating the program?
some --fail-fast switch for vmi-win-offsets would be also very handy
long topic - study the exact causes

diff --git a/libvmi/os/windows/core.c b/libvmi/os/windows/core.c
index 62ab1bd..654385d 100644
--- a/libvmi/os/windows/core.c
+++ b/libvmi/os/windows/core.c
@@ -842,8 +842,8 @@ init_from_json_profile_real(vmi_instance_t vmi, reg_t kpcr_register_to_use)

         if ( VMI_SUCCESS == kpcr_find1(vmi, windows, kpcr_reg) ) {}
         else if ( VMI_SUCCESS == kpcr_find2(vmi, windows) ) {}
-        else if ( VMI_SUCCESS == kpcr_find3(vmi, windows) ) {}
-        else if ( VMI_SUCCESS == kpcr_find4(vmi, windows, kpcr_reg) ) {}
+        //else if ( VMI_SUCCESS == kpcr_find3(vmi, windows) ) {}
+        //else if ( VMI_SUCCESS == kpcr_find4(vmi, windows, kpcr_reg) ) {}
         else goto done;

         if ( VMI_FAILURE == vmi_translate_kv2p(vmi, windows->ntoskrnl_va, &windows->ntoskrnl) || !windows->ntoskrnl ) {
diff --git a/libvmi/os/windows/kdbg.c b/libvmi/os/windows/kdbg.c
index e0d9a8f..12ed03b 100644
--- a/libvmi/os/windows/kdbg.c
+++ b/libvmi/os/windows/kdbg.c
@@ -693,6 +693,8 @@ status_t find_kdbg_address(

     dbprint(VMI_DEBUG_MISC, "**Trying find_kdbg_address\n");

+    return VMI_FAILURE;
+
     status_t ret = VMI_FAILURE;
     *kdbg_pa = 0;
     addr_t paddr = 0;
@@ -1194,6 +1196,7 @@ init_from_kdbg(
     // so lets try our kdbg search method
 find_kdbg:
     dbprint(VMI_DEBUG_MISC, "**Attempting KdDebuggerDataBlock search methods\n");
+    goto exit;

     if (VMI_SUCCESS == find_kdbg_address_instant(vmi, &kdbg_pa, &kernbase_pa, &kernbase_va)) {
         goto found;

chivay commented 4 years ago

After long and tedious debugging session, it seems that GCP won't be able to run drakvuf. LibVMI errors with:

VMI_ERROR: xen_start_single_step error: no system support for event type
Failed to register singlestep for vCPU 0
VMI_ERROR: xc_altp2m_switch_to_view returned rc: -1

Attaching full debug log log.txt

CERT-Polska / drakvuf-sandbox

Compatibility study for Google Cloud Platform #93