KVM-VMI / kvm-vmi

KVM-based Virtual Machine Introspection
https://kvm-vmi.github.io/kvm-vmi/master/
314 stars 60 forks source link

Missing SYSRETs #13

Open shlomopongratz opened 7 years ago

shlomopongratz commented 7 years ago

Hi,

I was counting the number of SYSCALLs and SYSRETs and I see that that from time to time SYSRET is missing and I assume that the depth of the stack increases. Is it possible that the kernel module sometimes fails to capture a SYSRET or that it is ignored from some reason?

Wenzel commented 7 years ago

hi @shlomopongartz ,

how did you make the counts ?

Yes, if an exit syscall is missing, the corresponding stack will continue to grow at the next syscall entering in the kernel. https://github.com/KVM-VMI/nitro/blob/master/nitro/backend.py#L49

Is it possible that the kernel module sometimes fails to capture a SYSRET or that it is ignored from some reason?

I don't see why such thing would happen ? the kernel is configured to report every possible event, and wait for the userland to confirm that the event has been processed before continuing the execution.

Let's investigate this issue, i want to make sure that the kernel code is behaving as it should be.

shlomopongratz commented 7 years ago

Just added a counters in the if statement "if event.direction == SyscallDirection.exit:" and in the corresponding "else:" statement. and print the balance.

shlomopongratz commented 7 years ago

Hi, From code reading of arch/x86/kvm/emulate.c::{em_syscall and em_sysret} I learn that in em_syscall event is marked present at the beginning of the routine regardless if the rutine returns before the end where as in em_sysret it is done at the end and the routine may return before the event could be marked present. So ether we are missing systret or we have extra syscall. S.P.

Wenzel commented 7 years ago

I monitored the enter and exit direction of the syscalls in backend.py:process_event, and here are my results after some time:

{
    "exit": 132515,
    "enter": 132629
}

So we have more exits than enter, i think you are right @shlomopongartz !

If we take a look at emulate.c:em_sysret:

static int em_sysret(struct x86_emulate_ctxt *ctxt)
{
    const struct x86_emulate_ops *ops = ctxt->ops;
    struct desc_struct cs, ss;
    u64 msr_data, rcx;
    u16 cs_sel, ss_sel;
    u64 efer = 0;
    struct kvm_vcpu *vcpu = container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt);
    /* syscall is not available in real mode */
    if (ctxt->mode == X86EMUL_MODE_REAL ||
        ctxt->mode == X86EMUL_MODE_VM86)
        return emulate_ud(ctxt);
    if (!(em_syscall_is_enabled(ctxt)))
        return emulate_ud(ctxt);

    if(ctxt->ops->cpl(ctxt) != 0){
        return emulate_gp(ctxt,0);
    }

    //check if RCX is in canonical form
    rcx = reg_read(ctxt, VCPU_REGS_RCX);
    if(( (rcx & 0xFFFF800000000000) != 0xFFFF800000000000) &&
      ( (rcx & 0x00007FFFFFFFFFFF) != rcx)){
        return emulate_gp(ctxt,0);
    }
    ops->get_msr(ctxt, MSR_EFER, &efer);
    setup_syscalls_segments(ctxt, &cs, &ss);
    if (!(efer & EFER_SCE) && !nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL))
        return emulate_ud(ctxt);
    ops->get_msr(ctxt, MSR_STAR, &msr_data);
    msr_data >>= 48;

    //setup code segment, atleast what is left to do.  
    //setup_syscalls_segments does most of the work for us
    if (ctxt->mode == X86EMUL_MODE_PROT64){ //if longmode
      cs_sel = (u16)((msr_data + 0x10) | 0x3);
      cs.l = 1;
      cs.d = 0;
    }
    else{
      cs_sel = (u16)(msr_data | 0x3);
      cs.l = 0;
      cs.d = 1;
    }
    cs.dpl = 0x3;

    //setup stack segment, atleast what is left to do.  
    //setup_syscalls_segments does most of the work for us
    ss_sel = (u16)((msr_data + 0x8) | 0x3);
    ss.dpl = 0x3;

    ops->set_segment(ctxt, cs_sel, &cs, 0, VCPU_SREG_CS);
    ops->set_segment(ctxt, ss_sel, &ss, 0, VCPU_SREG_SS);

    ctxt->eflags = (reg_read(ctxt, VCPU_REGS_R11) & 0x3c7fd7) | 0x2;
    ctxt->_eip = reg_read(ctxt, VCPU_REGS_RCX);

    if(nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL)){
        vcpu->nitro.event.present = true;
        vcpu->nitro.event.type = SYSCALL;
        vcpu->nitro.event.direction = EXIT;
        kvm_arch_vcpu_ioctl_get_regs(vcpu, &(vcpu->nitro.event.regs));
        kvm_arch_vcpu_ioctl_get_sregs(vcpu, &(vcpu->nitro.event.sregs));
    }
    return X86EMUL_CONTINUE;
}

The emulation has 5 occasions to fail and return before setting the event as present.

compared to em_syscall:

static int em_syscall(struct x86_emulate_ctxt *ctxt)
{
    const struct x86_emulate_ops *ops = ctxt->ops;
    struct desc_struct cs, ss;
    u64 msr_data;
    u16 cs_sel, ss_sel;
    u64 efer = 0;
    struct kvm_vcpu *vcpu = container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt);
    if(nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL)){
        vcpu->nitro.event.present = true;
        vcpu->nitro.event.type = SYSCALL;
        vcpu->nitro.event.direction = ENTER;
        kvm_arch_vcpu_ioctl_get_regs(vcpu, &(vcpu->nitro.event.regs));
        kvm_arch_vcpu_ioctl_get_sregs(vcpu, &(vcpu->nitro.event.sregs));
    }
    /* syscall is not available in real mode */
    if (ctxt->mode == X86EMUL_MODE_REAL ||
        ctxt->mode == X86EMUL_MODE_VM86)
        return emulate_ud(ctxt);
    if (!(em_syscall_is_enabled(ctxt)))
        return emulate_ud(ctxt);
    ops->get_msr(ctxt, MSR_EFER, &efer);
    setup_syscalls_segments(ctxt, &cs, &ss);
    if (!(efer & EFER_SCE) && !nitro_is_trap_set(vcpu->kvm, NITRO_TRAP_SYSCALL))
        return emulate_ud(ctxt);

    ops->get_msr(ctxt, MSR_STAR, &msr_data);
    msr_data >>= 32;
    cs_sel = (u16)(msr_data & 0xfffc);
    ss_sel = (u16)(msr_data + 8);
    if (efer & EFER_LMA) {
        cs.d = 0;
        cs.l = 1;
    }
    ops->set_segment(ctxt, cs_sel, &cs, 0, VCPU_SREG_CS);
    ops->set_segment(ctxt, ss_sel, &ss, 0, VCPU_SREG_SS);
    *reg_write(ctxt, VCPU_REGS_RCX) = ctxt->_eip;
    if (efer & EFER_LMA) {
#ifdef CONFIG_X86_64
        *reg_write(ctxt, VCPU_REGS_R11) = ctxt->eflags;
        ops->get_msr(ctxt,
                 ctxt->mode == X86EMUL_MODE_PROT64 ?
                 MSR_LSTAR : MSR_CSTAR, &msr_data);
        ctxt->_eip = msr_data;
        ops->get_msr(ctxt, MSR_SYSCALL_MASK, &msr_data);
        ctxt->eflags &= ~msr_data;
        ctxt->eflags |= X86_EFLAGS_FIXED;
#endif
    } else {
        /* legacy mode */
        ops->get_msr(ctxt, MSR_STAR, &msr_data);
        ctxt->_eip = (u32)msr_data;
        ctxt->eflags &= ~(X86_EFLAGS_VM | X86_EFLAGS_IF);
    }

    return X86EMUL_CONTINUE;
}

I think the fix here is that we should move the code which configures the event from the beginning of the function to the end of em_syscall (and em_sysenter too).

I will try this solution and post the results !

Wenzel commented 7 years ago

Actually it's the reverse, we should move the code that configures the event in em_sysret and em_sysexit at the beginning of the function, so that we are not missing any syscall.

shlomopongratz commented 7 years ago

But if you put the code at the beginning of the sysret than this is before set_segment is called. Maybe you should just mark the present at the end of the syscall (but take the registers at the beginning, before they are changed). So we will record only syscall with successful emulation.

Wenzel commented 7 years ago

So, if we want to monitor all syscall attempt (good or bad), we have to place our event at the beginning of the function, and if we want only the successful ones, place it at the end like you said.

I implemented your solution on a new branch. here is the commit: https://github.com/KVM-VMI/kvm/commit/d5fe8f53b448235d2dd25b9d5e037cd7c87d0390

The results are better, but we are still missing some exits !

{
    "exit": 64267,
    "enter": 64314
}
shlomopongratz commented 7 years ago

So it seems that for some successful syscalls there are failed sysrets. I just wonder if we can trust the values stored in the registers if we record at the beginning of sysret. Maybe we should record all events and add a completion flag that wll be set only at the end so we will know if we can trust the values stored in the registers.

shlomopongratz commented 7 years ago

The trust should also go for the syscall.

Wenzel commented 7 years ago

If i place the event at the beginning :

{
    "enter": 64289,
    "exit": 64198
}

Almost the same results

shlomopongratz commented 7 years ago

This is odd, I wonder if the number of times em_syscall and em_sysret are the same. The OS must return to user before issuing a new system call so maybe sometimes the emulation is not called. So maybe we are bailing out in x86_decode_insn or before that.

shlomopongratz commented 7 years ago

Hi, I added printk to em_syscall and em_sysret and found out the the number of times em_syscall is called is greater then the number of times em_sysret is called. This imbalance can cause two issues, one is syscall stack overflow and the other is wrong matching between calls and returns, assume one wants to process OUT syscall parameters only on return and he/she may procces the wrong parameters.

I think that maybe some of the syscalls are not really syscalls, I used the value in the GS register which holds the pointer to the Thread Information Block in 64 bit (FS in 32 bit guests) and tried to read it with vmi_read_addr_va and sometimes it failed! which I think may imply that it wasn't a real syscall.

S.P.

shlomopongratz commented 7 years ago

Hi,

Just found out that all missing SYSRET belong to calls to NtContinue! That is I see SYSCALL with syscall number that matches NtContinue without a return. Can someone explain this to me.

S.P.

Wenzel commented 7 years ago

Hi,

I think i found why.

After a bit of googling, i saw this comment: http://www.abfl.org.in/Aug-08/interrupts-return-quarry-iretq/

However, NtContinue uses iretq to return to usermode, not sysret.

We have to change our logic, and not systematically associate a syscall with a sysret.

shlomopongratz commented 7 years ago

Hi, Funny but the link took me to the web page of "GME mining equipment production base," I think there is another issue regarding the association of syscall and sysret. Currently Nitio uses stack to find the syscall number for the sysret. This doesn't take scheduling into account, assume one thread calls some blocking API e.g. NtWaitForSingleObject than the scheduler may let another thread to run which also calls a blocking API then the first thread resume execution and the first syscall returns. What we need is to map from the combination of the process and the thread to the system call number MAP{CR3x{fs|gs}} -> syscall.

Best regards

S.P.