NRE crash on Dell Optiplex 9010

fbq commented 11 years ago

Hi all, I am trying to install NRE on my Dell Optiplex 9010 desktop, and meet a problem. I used the official branch and compiled the cross-compiler by myself, because I am using archlinux(the details about my platform will be given later). I compiled NRE with NRE_TARGET=x86_64 and NRE_BUILD=release. I used grub2 as my bootloader and gpt as partition table, and I generated grub.cfg using nre/tools/novaboot and nre/boot/linux. When I boot NRE, I got this:

NOVA Microhypervisor v5-c565aec (x86_64): Dec 12 2012 21:10:16 [gcc 4.7.2]

[ 0] CORE:0:0:0 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 5] CORE:0:1:1 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 7] CORE:0:3:1 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 1] CORE:0:1:0 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 3] CORE:0:3:0 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 6] CORE:0:2:1 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 2] CORE:0:2:0 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 4] CORE:0:0:1 6:3a:9:1 [12] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 0] Killed EC:0xffffffff82382100 SC:0xffffffff823d1740 V:0xe CS:0x2b EIP:0x7fff fffff00 CR2:0x7ffffffff000 ERR:0x14 (EXC)

I found the last line was printed by function handle_exc in kernel/nova/src/ec_exc.cpp. Then I changed to NRE_BUILD=debug, I got the same error plus output like follows: [ 1] VMCS:0x0179500 REV:0x10 EPT:1 URG:1 VNM:1 VPID:1

Since I am not quite familiar with NOVA and NRE, so I don't know which part introduce this. BTW, I have tried NUL on this computer and it worked well. Could you fix this or tell me the reasons and what to do? Thanks very much.

Some details about my platform:

Computer: Dell Optiplex 9010
CPU: Intel(R) Core(TM) i7-3770
RAM: 4G
OS: Archlinux x86_64
Compiler: GCC 4.7.2 (I think this only affects the hypervisor)

Thanks

Feng, Boqun

udosteinberg commented 11 years ago

[ 0] Killed EC:0xffffffff82382100 SC:0xffffffff823d1740 V:0xe CS:0x2b EIP:0x7ffffffff00 CR2:0x7ffffffff000 ERR:0x14 (EXC)

This says the microhypervisor has killed an execution context due to an unhandled page fault at address 0x7ffffffff000, which was caused by an instruction fetch while in user mode. So the problem lies somewhere in NRE. Can you figure out what code is (supposed to be) located at that address? Then @Nils-TUD can probably tell you what's going on.

Also, are you sure you copied the message correctly? EIP and CR2 should be equal and in your case the EIP seems to miss a trailing 0.

Nils-TUD commented 11 years ago

There isn't any code at that location or even near it. So, NRE seems to erroneously jump to that location for some reason. The address 0x7ffffffff000 is the end of the user-part of the address-space and the page behind the Hip. But I don't know if that helps..

If you wouldn't test it on real hardware, it should be possible to hunt the error down by stepping through the startup of the root task. But I guess in qemu or so, it works for you, right?

One thing that might help is logging the syscalls in NOVA. You can do that in kernel/nova/include/stdio.h by enabling TRACE_SYSCALL. This way, we can see what syscalls are done before, so that we can perhaps roughly say where this strange jumps happens.

Edit: I think what happens is that somewhere in NRE a pagefault occurs, then NRE's pagefault handler kicks in and prints the address, backtrace and so on. But we can't see that because during the startup-phase we might not yet be able to print to serial line. Afterwards I set EIP to 0x7ffffffff000 to let the kernel kill the Ec. I should try to further minimize the stuff that happens without being able to print to serial line... But as a quicker way of finding the error: You could comment out the lines 91,92 in nre/services/root/root.cc: new Pt(ec, ec->event_base() + CapSelSpace::EV_PAGEFAULT, portal_pagefault, Mtd(Mtd::RSP | Mtd::GPR_BSD | Mtd::RIP_LEN | Mtd::QUAL)); This way NRE doesn't intercept the pagefault and we should see the actual pagefault location when NOVA kills the Ec.

udosteinberg commented 11 years ago

I've just tried and I'm seeing the same behavior on a similar Ivy Bridge machine.

NOVA Microhypervisor v5-6259305 (x86_64): Nov 4 2012 17:30:58 [gcc 4.8.0]

[ 0] CORE:0:0:0 6:3a:9:1 [16] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 1] CORE:0:1:0 6:3a:9:1 [16] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 3] CORE:0:3:0 6:3a:9:1 [16] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [ 2] CORE:0:2:0 6:3a:9:1 [16] Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [0] Welcome to NRE rev b0a808d-dirty built with gcc 4.6.3 [0] Hip checksum is valid [0] SEL: 67108864, EXC: 32, VMI: 256, GSI: 128 [0] CPU runs @ 3392 Mhz, bus @ 99 Mhz [0] Root: Pagefault for 0000:0002:0000:0020 @ 0000:0000:0001:348a on cpu 0, error=0x6 [0] Backtrace: [0] ffff:ffff:ffff:fffb [ 0] Killed EC:0xffffffff823ce880 SC:0xffffffff823d03c0 V:0xe CS:0x2b EIP:0x7ffffffff000 CR2:0x7ffffffff000 ERR:0x14 (EXC)

Nils-TUD commented 11 years ago

Ok, what code is at 0x1348a in your root-task?

udosteinberg commented 11 years ago

With a debug build the addresses obviously change a bit, so here goes...

[0] Welcome to NRE rev b0a808d built with gcc 4.6.3 [0] Hip checksum is valid [0] SEL: 67108864, EXC: 32, VMI: 256, GSI: 128 [0] CPU runs @ 3393 Mhz, bus @ 99 Mhz [0] Root: Pagefault for 0000:0002:0000:0020 @ 0000:0000:0000:d261 on cpu 0, error=0x6 [0] Backtrace: [0] ffff:ffff:ffff:fffb [ 0] Killed EC:0xffffffff823c3880 SC:0xffffffff823c53c0 V:0xe CS:0x2b EIP:0x7ffffffff000 CR2:0x7ffffffff000 ERR:0x14 (EXC)

000000000000d144 <PhysicalMemory::MemRegion::operator new(unsigned long)>:
    d144:       55                      push   %rbp
    d145:       48 89 e5                mov    %rsp,%rbp
    d148:       48 83 ec 40             sub    $0x40,%rsp
    d14c:       48 89 7d c8             mov    %rdi,-0x38(%rbp)
    d150:       48 b8 20 84 05 00 00    movabs $0x58420,%rax
    d157:       00 00 00 
    d15a:       0f b6 00                movzbl (%rax),%eax
    d15d:       83 f0 01                xor    $0x1,%eax
    d160:       84 c0                   test   %al,%al
    d162:       0f 84 8a 00 00 00       je     d1f2 <PhysicalMemory::MemRegion::operator new(unsigned long)+0xae>
    d168:       48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
    d16f:       00 
    d170:       eb 67                   jmp    d1d9 <PhysicalMemory::MemRegion::operator new(unsigned long)+0x95>
    d172:       48 b8 60 83 05 00 00    movabs $0x58360,%rax
    d179:       00 00 00 
    d17c:       48 8b 08                mov    (%rax),%rcx
    d17f:       48 be 80 83 05 00 00    movabs $0x58380,%rsi
    d186:       00 00 00 
    d189:       48 8b 55 f8             mov    -0x8(%rbp),%rdx
    d18d:       48 89 d0                mov    %rdx,%rax
    d190:       48 c1 e0 02             shl    $0x2,%rax
    d194:       48 01 d0                add    %rdx,%rax
    d197:       48 c1 e0 03             shl    $0x3,%rax
    d19b:       48 01 f0                add    %rsi,%rax
    d19e:       48 83 c0 20             add    $0x20,%rax
    d1a2:       48 89 08                mov    %rcx,(%rax)
    d1a5:       48 8b 55 f8             mov    -0x8(%rbp),%rdx
    d1a9:       48 89 d0                mov    %rdx,%rax
    d1ac:       48 c1 e0 02             shl    $0x2,%rax
    d1b0:       48 01 d0                add    %rdx,%rax
    d1b3:       48 c1 e0 03             shl    $0x3,%rax
    d1b7:       48 89 c2                mov    %rax,%rdx
    d1ba:       48 b8 80 83 05 00 00    movabs $0x58380,%rax
    d1c1:       00 00 00 
    d1c4:       48 01 c2                add    %rax,%rdx
    d1c7:       48 b8 60 83 05 00 00    movabs $0x58360,%rax
    d1ce:       00 00 00 
    d1d1:       48 89 10                mov    %rdx,(%rax)
    d1d4:       48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
    d1d9:       48 83 7d f8 03          cmpq   $0x3,-0x8(%rbp)
    d1de:       0f 96 c0                setbe  %al
    d1e1:       84 c0                   test   %al,%al
    d1e3:       75 8d                   jne    d172 <PhysicalMemory::MemRegion::operator new(unsigned long)+0x2e>
    d1e5:       48 b8 20 84 05 00 00    movabs $0x58420,%rax
    d1ec:       00 00 00 
    d1ef:       c6 00 01                movb   $0x1,(%rax)
    d1f2:       48 b8 60 83 05 00 00    movabs $0x58360,%rax
    d1f9:       00 00 00 
    d1fc:       48 8b 00                mov    (%rax),%rax
    d1ff:       48 85 c0                test   %rax,%rax
    d202:       0f 85 84 00 00 00       jne    d28c <PhysicalMemory::MemRegion::operator new(unsigned long)+0x148>
    d208:       be 00 10 00 00          mov    $0x1000,%esi
    d20d:       48 bf 40 84 05 00 00    movabs $0x58440,%rdi
    d214:       00 00 00 
    d217:       48 b8 1e ec 00 00 00    movabs $0xec1e,%rax
    d21e:       00 00 00 
    d221:       ff d0                   callq  *%rax
    d223:       48 89 45 e0             mov    %rax,-0x20(%rbp)
    d227:       48 8b 45 e0             mov    -0x20(%rbp),%rax
    d22b:       48 89 c7                mov    %rax,%rdi
    d22e:       48 b8 e6 ef 00 00 00    movabs $0xefe6,%rax
    d235:       00 00 00 
    d238:       ff d0                   callq  *%rax
    d23a:       48 89 45 d8             mov    %rax,-0x28(%rbp)
    d23e:       48 8b 45 d8             mov    -0x28(%rbp),%rax
    d242:       48 89 45 f0             mov    %rax,-0x10(%rbp)
    d246:       48 c7 45 e8 00 00 00    movq   $0x0,-0x18(%rbp)
    d24d:       00 
    d24e:       eb 30                   jmp    d280 <PhysicalMemory::MemRegion::operator new(unsigned long)+0x13c>
    d250:       48 b8 60 83 05 00 00    movabs $0x58360,%rax
    d257:       00 00 00 
    d25a:       48 8b 10                mov    (%rax),%rdx
    d25d:       48 8b 45 f0             mov    -0x10(%rbp),%rax
    d261:       48 89 50 20             mov    %rdx,0x20(%rax)
    d265:       48 b8 60 83 05 00 00    movabs $0x58360,%rax
    d26c:       00 00 00 
    d26f:       48 8b 55 f0             mov    -0x10(%rbp),%rdx
    d273:       48 89 10                mov    %rdx,(%rax)
    d276:       48 83 45 f0 28          addq   $0x28,-0x10(%rbp)
    d27b:       48 83 45 e8 01          addq   $0x1,-0x18(%rbp)
    d280:       48 83 7d e8 65          cmpq   $0x65,-0x18(%rbp)
    d285:       0f 96 c0                setbe  %al
    ...

fbq commented 11 years ago

@udosteinberg, thanks for your explanation, and the EIP value of my output is 0x7ffffffff000, sorry for that.

fbq commented 11 years ago

@Nils-TUD, after I commented out two lines in nre/services/root/root.cc, I got this:

[ 0] Killed EC:0xffffffff82396100 SC:0xffffffff823c5740 V:0xe CS:0x2b EIP:0xd261 CR2:0x200000020 ERR:0x6 (PT not found)

udosteinberg commented 11 years ago

Yeah, that nicely matches up with the backtrace I got.

udosteinberg commented 11 years ago

Edit: I think what happens is that somewhere in NRE a pagefault occurs, then NRE's pagefault handler kicks in and prints the address, backtrace and so on. But we can't see that because during the startup-phase we might not yet be able to print to serial line. Afterwards I set EIP to 0x7ffffffff000 to let the kernel kill the Ec.

@Nils-TUD - how about setting the kill address to something more magical, like 0xdeaddead or 0xdeadc0de and likewise for the 64bit version. That makes it totally obvious what's going on.

Nils-TUD commented 11 years ago

Could you please try https://github.com/Nils-TUD/NRE/commit/b776b82ff4dcd78992c26bf48cd61a0ed70ae761 ? This should fix the problem.

@udosteinberg Well, I could, but I don't see a real advantage in it. As long as there is no error message, you still don't know what the actual problem is. Which should actually rarely happen, because the startup-phase isn't that long and complicated.

udosteinberg commented 11 years ago

@Nils-TUD - The patch fixes that problem for me, but it still doesn't boot successfully...

It hangs after...

[1] ACPI::IRQ_TO_GSI 1 -> 1 [1] ACPI::IRQ_TO_GSI 12 -> 12

... due to legacy USB emulation.

Nils-TUD commented 11 years ago

hm, it would be interesting to know where it hangs... But legacy USB emulation doesn't yet work on my test machine either. Although it doesn't hang but just doesn't give me keyboard-interrupts, IIRC. I have to take a closer look on that when I have some time.

udosteinberg commented 11 years ago

Legacy USB aside, I'm also seeing pcicfg die. Any idea what's going on?

[0] Starting child 'nre/pcicfg'... [0] Child[cmdline='nre/pcicfg' cpu=0 entry=0000:0003:0030:0098]: [0] Scs: [0] GSIs: Bitfield[0000000000000000 0000000000000000] [0] Ports: [0] Dataspaces: [0] 0000:0003:0030:0000 .. 0000:0003:0032:f000 (0x000000000002f000 bytes) or-x <- 0000:0002:002f:8000 [0] 0000:0003:0032:f000 .. 0000:0003:0034:3000 (0x0000000000014000 bytes) orw- <- 0000:0002:0032:7000 [0] 0000:0003:0034:4000 .. 0000:0003:0034:5000 (0x0000000000001000 bytes) ---- <- 0000:0000:0000:0000 [0] 0000:0003:0034:6000 .. 0000:0003:0034:8000 (0x0000000000002000 bytes) orw- <- 0000:0000:0063:a000 [0] 0000:0003:0034:9000 .. 0000:0003:0034:a000 (0x0000000000001000 bytes) or-- <- 0000:0002:0033:d000 [0] 0000:0003:0034:b000 .. 0000:0003:0034:c000 (0x0000000000001000 bytes) or-- <- 0000:0002:0033:e000 [0] [0] [1] ACPI::FIND_TABLE 'MCFG' #0 -> found [4] Invalid arguments: Received invalid translation

This seems to indicate a problem. And then it blows up...

[0] Child 'nre/pcicfg': Unresolvable pagefault for 0000:0000:0000:0000 @ 0000:0003:0030:626e on cpu 3, error=0x4 [0] Child 'nre/pcicfg': caused exception 14 @ 0000:0003:0030:626e on cpu 3 [0] Dataspaces: [0] 0000:0003:0030:0000 .. 0000:0003:0032:f000 (0x000000000002f000 bytes) or-x <- 0000:0002:002f:8000 [0] 0000:0003:0032:f000 .. 0000:0003:0034:3000 (0x0000000000014000 bytes) orw- <- 0000:0002:0032:7000 [0] 0000:0003:0034:4000 .. 0000:0003:0034:5000 (0x0000000000001000 bytes) ---- <- 0000:0000:0000:0000 [0] 0000:0003:0034:6000 .. 0000:0003:0034:8000 (0x0000000000002000 bytes) orw- <- 0000:0000:0063:a000 [0] 0000:0003:0034:9000 .. 0000:0003:0034:a000 (0x0000000000001000 bytes) or-- <- 0000:0002:0033:d000 [0] 0000:0003:0034:b000 .. 0000:0003:0034:c000 (0x0000000000001000 bytes) or-- <- 0000:0002:0033:e000 [0] 0000:0003:0034:d000 .. 0000:0003:0036:d000 (0x0000000000020000 bytes) orw- <- 0000:0002:0033:f000 [0] 0000:0003:0036:e000 .. 0000:0003:0037:0000 (0x0000000000002000 bytes) orw- <- 0000:0002:0035:f000 [0] 0000:0003:0037:1000 .. 0000:0003:0037:2000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0037:4000 .. 0000:0003:0037:6000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:1000 [0] 0000:0003:0037:7000 .. 0000:0003:0037:8000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0037:a000 .. 0000:0003:0037:c000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:3000 [0] 0000:0003:0037:d000 .. 0000:0003:0037:e000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0038:0000 .. 0000:0003:0038:2000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:5000 [0] 0000:0003:0038:3000 .. 0000:0003:0038:4000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0038:6000 .. 0000:0003:0038:8000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:7000 [0] 0000:0003:0038:9000 .. 0000:0003:0038:a000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0038:c000 .. 0000:0003:0038:e000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:9000 [0] 0000:0003:0038:f000 .. 0000:0003:0039:0000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0039:2000 .. 0000:0003:0039:4000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:b000 [0] 0000:0003:0039:5000 .. 0000:0003:0039:6000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] 0000:0003:0039:8000 .. 0000:0003:0039:a000 (0x0000000000002000 bytes) orw- <- 0000:0002:0036:d000 [0] 0000:0003:0039:b000 .. 0000:0003:0039:c000 (0x0000000000001000 bytes) -rw- <- 0000:0000:0000:0000 [0] Unable to resolve fault; killing child [0] Backtrace: [0] 0000:0000:0003:7e2f [ 3] Killed EC:0xffffffff810c2880 SC:0xffffffff822044a0 V:0xe CS:0x2b EIP:0x7ffffffff000 CR2:0x7ffffffff000 ERR:0x14 (EXC) [7] Systemcall failed: Communication Abort(2) [7] Backtrace: [7] 0000:0003:0051:64e2 [7] 0000:0003:0050:1268 [7] 0000:0003:0050:149b [7] 0000:0003:0050:12ac [7] 0000:0003:0050:14fd [7] 0000:0003:0050:24fe [7] 0000:0003:0051:f5ee [7] 0000:0003:0051:e575 [7] 0000:0003:0051:eabc [7] 0000:0003:0050:044b [7] 0000:0003:0051:42a8 [7] 0000:0003:0051:0bb3 [7] 0000:0003:0050:0129 [0] Child 'nre/storage': Pd terminated with exit code 1 on cpu 3 [0] Backtrace: [0] 0000:0003:0051:7008 [0] 0000:0003:0051:735c [0] 0000:0003:0052:b541 [0] 0000:0003:0050:12ac [0] 0000:0003:0050:14fd [0] 0000:0003:0050:24fe [0] 0000:0003:0051:f5ee [0] 0000:0003:0051:e575 [0] 0000:0003:0051:eabc [0] 0000:0003:0050:044b [0] 0000:0003:0051:42a8 [0] 0000:0003:0051:0bb3 [0] 0000:0003:0050:0129 [ 3] Killed EC:0xffffffff8224b6a0 SC:0xffffffff822044a0 V:0xe CS:0x2b EIP:0x7ffffffff000 CR2:0x7ffffffff000 ERR:0x14 (EXC)

fbq commented 11 years ago

@Nils-TUD the branch of your tree fix the problem, but I still hang at the MUNICH start kernel stage, and for the offical branch, I tested both debug and release version, the result is only debug version fix it, and in release version, I got this:

[ 0] Killed EC:0xffffffff82382100 SC:0xffffffff823d2740 V:0xe CS:0x2b EIP:0x1f25 CR2:0x1f20 ERR:0x7 (PT not found) (I didn't comment 91,92 lines in root.cc)

Edit: Since I am living in UTC+8 time zone, so I have to go to sleep and I not able to test your modification, thanks a lot for them.

Edit: @Nils-TUD, your patch fix the initial problem, thanks :) And the other EC killed problem I said in this comment is because something wrong in my kernel/nova/Makefile, I found it and it doesn't happen again.

Nils-TUD commented 11 years ago

@udosteinberg From the log I would say that the construction of HostMMConfig in pcicfg fails, so that the object doesn't exist afterwards, which in turn causes the null-dereference error. The constructor seems to fail because of a wrong translation, but I'm not sure which translation and why. Could you enable EXCEPTIONS in include/Logging.h? Then you can see every exception that is thrown directly where that happens. I think it happens not in pcicfg but in root or so. In the log you'll see backtraces. You can add symbols, line numbers etc. by starting "./b trace=$app", where $app is e.g. "root", and simply pasting the backtrace in there. Btw, is it reproducible?

@fbq I'm not sure if I understood you correctly. The initial problem is fixed in the debug and release version, right? And you tried the official branch again? But this can't really work because you would still have the initial problem.

Apart from that: Are you using the unmodified boot/linux boot-script? In any case I need more information. Better post the whole log-file and I need to know to what code the addresses map for you.

fbq commented 11 years ago

@Nils-TUD I am using unmodified boot/linux boot-script, and generate grub2 menuentry with tools/novaboot, here is my nre grub2 menuentry: menuentry nre { set root='hd0,gpt2' multiboot /nova/bin/apps/hypervisor spinner serial module /nova/bin/apps/root bin/apps/root module /nova/bin/apps/acpi bin/apps/acpi provides=acpi module /nova/bin/apps/keyboard bin/apps/keyboard provides=keyboard scset1 module /nova/bin/apps/reboot bin/apps/reboot provides=reboot module /nova/bin/apps/pcicfg bin/apps/pcicfg provides=pcicfg module /nova/bin/apps/timer bin/apps/timer provides=timer module /nova/bin/apps/console bin/apps/console provides=console module /nova/bin/apps/sysinfo bin/apps/sysinfo module /nova/bin/apps/storage bin/apps/storage provides=storage module /nova/bin/apps/vancouver bin/apps/vancouver mods=following lastmod m:128 ncpu:1 PC_PS2 ahci:0xe0800000,14,0x30 drive:0,1,2 module /nova/bin/apps/guest_munich bin/apps/guest_munich module /nova/dist/imgs/bzImage-3.1.0-32 dist/imgs/bzImage-3.1.0-32 clocksource=tsc console=ttyS0 noapic module /nova/dist/imgs/initrd-js.lzma dist/imgs/initrd-js.lzma }

About the log file, I run NRE in a real machine, and I didn't find any log file. How can I enable it?

Nils-TUD commented 11 years ago

NRE prints to serial line. So you have to connect a serial cable to your machine to get the log.

udosteinberg commented 11 years ago

This is the debug output...

[0] Starting child 'nre/pcicfg'... [0] Child[cmdline='nre/pcicfg' cpu=0 entry=0000:0003:0030:0098]: [0] Scs: [0] GSIs: Bitfield[0000000000000000 0000000000000000] [0] Ports: [0] Dataspaces: [0] 0000:0003:0030:0000 .. 0000:0003:0032:f000 (0x000000000002f000 bytes) or-x <- 0000:0002:002f:8000 [0] 0000:0003:0032:f000 .. 0000:0003:0034:3000 (0x0000000000014000 bytes) orw- <- 0000:0002:0032:7000 [0] 0000:0003:0034:4000 .. 0000:0003:0034:5000 (0x0000000000001000 bytes) ---- <- 0000:0000:0000:0000 [0] 0000:0003:0034:6000 .. 0000:0003:0034:8000 (0x0000000000002000 bytes) orw- <- 0000:0000:0064:4000 [0] 0000:0003:0034:9000 .. 0000:0003:0034:a000 (0x0000000000001000 bytes) or-- <- 0000:0002:0033:d000 [0] 0000:0003:0034:b000 .. 0000:0003:0034:c000 (0x0000000000001000 bytes) or-- <- 0000:0002:0033:e000 [0] [0] [1] ACPI::FIND_TABLE 'MCFG' #0 -> found [0] Exception: Invalid arguments (18): Received invalid translation [0] Backtrace: [0] 0000:0000:0001:9522 [0] 0000:0000:0000:3686 [0] 0000:0000:0000:41f6 [0] 0000:0000:0002:67cd [0] 0000:0000:0002:7c22 [0] 0000:0000:0003:7ea3 [4] Exception: Failure (19): [4] Backtrace: [4] 0000:0003:0030:a4d2 [4] 0000:0003:0030:2210 [4] 0000:0003:0030:fd5c [4] 0000:0003:0030:2d9f [4] 0000:0003:0030:3ce2 [4] 0000:0003:0030:04f6 [4] 0000:0003:0030:65d3 [4] 0000:0003:0030:0129 [4] Invalid arguments: Received invalid translation

And the backtraces are decoded as follows:

 0x19522 nre::Exception::Exception(nre::ErrorCode, nre::String const&)(T) +   0x9e = 0x19484 in /libs/libstdc++/Exception.cc:25
 0x3686 nre::UtcbException::UtcbException(nre::ErrorCode, nre::String const&)(W) +   0x2a = 0x365c in /include/utcb/UtcbFrame.h:46
 0x41f6 nre::UtcbFrameRef::get_translated(unsigned int)(W) +  0x13c = 0x40ba in /include/utcb/UtcbFrame.h:393
 0x267cd nre::ChildManager::map(nre::UtcbFrameRef&, nre::Child*, nre::DataSpace::RequestType)(T) +   0x7f = 0x2674e in /libs/libstdc++/subsystem/ChildManager.cc:664
 0x27c22 nre::ChildManager::Portals::dataspace(unsigned int)(T) +   0xf4 = 0x27b2e in /libs/libstdc++/subsystem/ChildManager.cc:846
 0x37ea3                ec_landing_spot(T) +    0x3 = 0x37ea0 in ??:0

 0x30030a4d2 nre::Exception::Exception(nre::ErrorCode, nre::String const&)(T) +   0x9e = 0x30030a434 in /libs/libstdc++/Exception.cc:25
 0x300302210 nre::UtcbFrameRef::check_reply()(W) +   0x6a = 0x3003021a6 in /include/utcb/UtcbFrame.h:253
 0x30030fd5c         nre::DataSpace::join()(T) +  0x188 = 0x30030fbd4 in /libs/libstdc++/mem/DataSpace.cc:78
 0x300302d9f nre::DataSpace::DataSpace(unsigned int)(W) +   0x45 = 0x300302d5a in /include/mem/DataSpace.h:95
 0x300303ce2 nre::ACPISession::find_table(nre::String const&, unsigned int) const(W) +  0x162 = 0x300303b80 in /include/services/ACPI.h:91
 0x3003004f6   HostMMConfig::HostMMConfig()(T) +   0xe4 = 0x300300412 in /services/pcicfg/HostMMConfig.cc:29
 0x3003065d3                           main(T) +   0x5b = 0x300306578 in /services/pcicfg/pcicfg.cc:122
 0x300300129                         _start(T) +   0x91 = 0x300300098 in ??:0

This is totally reproducible.

Nils-TUD commented 11 years ago

Ok, thanks. hm...are you using a newer NOVA revision than the one that NRE uses by default? Because atm I can't imagine that this error is hardware dependent or similar (I don't see it) and since I'm still using the trick with the 2 Sm permission bits to delegate a dataspace with restricted permissions, the error would make perfectly sense. Because ACPI passes the dataspace for a ACPI table out with only read permissions, which would mean that both bits are zero. So, if your change to only set the defined bits for each Kernel object is present, the translation would of course fail since all permission bits are zero.

udosteinberg commented 11 years ago

Yes, I'm using my internal development version with some patches on top of it. And that is already doing the strict capability permission thing, so I guess that explains why NRE won't work with it.

TUD-OS / NRE

NRE crash on Dell Optiplex 9010 #26