brho / akaros

Akaros Operating System
http://akaros.cs.berkeley.edu/
Other
314 stars 61 forks source link

assertion failed: page && pm_slot_check_refcnt(*page->pg_tree_slot) #42

Closed dvyukov closed 7 years ago

dvyukov commented 7 years ago

I am getting the following crashes. Is it a know issue? If not and you don't see why it happens right away, I can try to create a reproducer. Checkout is on 6344ed04e307ba30df879d1d407b10a1b3236784.

bash-4.3$ Unhandled user trap in vcore context from VC 1
HW TRAP frame (partial) at 0xffffffffc82cbd20 on core 1
  rax  0x0000100000011743
  rbx  0x000030000005ced0
  rcx  0x0000000000000001
  rdx  0x0000100000011740
  rbp  0x000030000005ceb0
  rsi  0x0000100000008820
  rdi  0x0000100000008820
  r8   0x0000000000000000
  r9   0x0000000000000000
  r10  0x000030000005ced0
  r11  0x0000000000000200
  r12  0x0000000000000001
  r13  0x0000000000000001
  r14  0x0000000000409720
  r15  0x0000000000000000
  trap 0x0000000d General Protection
  gsbs 0x0000000000000000
  fsbs 0x0000000000000000
  err  0x--------00000000
  rip  0x00000000004005f0
  cs   0x------------0023
  flag 0x0000000000010286
  rsp  0x000030000005ce98
  ss   0x------------001b
err 0x0 (for PFs: User 4, Wr 2, Rd 1), aux 0x0000000000000000
Addr 0x00000000004005f0 is in syz-executor at offset 0x00000000000005f0
VM Regions for proc 540
NR:                                     Range:       Prot,      Flags,               File,                Off
00: (0x0000000000400000 - 0x00000000004b2000): 0x00000005, 0x00000001, 0xffff800101103840, 0x0000000000000000
01: (0x00000000004b2000 - 0x00000000004b3000): 0x00000005, 0x00000002, 0xffff800101103840, 0x00000000000b2000
02: (0x00000000006b3000 - 0x00000000006b6000): 0x00000003, 0x00000002, 0xffff800101103840, 0x00000000000b3000
03: (0x00000000006b6000 - 0x0000000000925000): 0x00000003, 0x00000002, 0x0000000000000000, 0x0000000000000000
04: (0x0000100000000000 - 0x0000100000024000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
05: (0x0000300000000000 - 0x0000300000001000): 0x00000003, 0x00000002, 0xffff800101103840, 0x0000000000000000
06: (0x0000300000001000 - 0x0000300000005000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
07: (0x0000300000005000 - 0x0000300000007000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
08: (0x0000300000007000 - 0x0000300000031000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
09: (0x0000300000031000 - 0x000030000005d000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
10: (0x00007f7fff8ff000 - 0x00007f7fff9ff000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000

Backtrace of user context on Core 1:
    Offsets only matter for shared libraries
#01 Addr 0x00000000004005f0 is in syz-executor at offset 0x00000000000005f0
#02 Addr 0x0000000000410394 is in syz-executor at offset 0x0000000000010394
#03 Addr 0x000000006b3a3000 has no VMR
Unhandled user trap in vcore context from VC 0
HW TRAP frame (partial) at 0xffffffffc82cc720 on core 5
  rax  0x0000100000005d03
  rbx  0x00007f7fff9feb80
  rcx  0x0000000000000001
  rdx  0x0000100000005d00
  rbp  0x00007f7fff9feb60
  rsi  0x00001000000046c0
  rdi  0x00001000000046c0
  r8   0x0000000000000000
  r9   0x0000000000000000
  r10  0x00007f7fff9feb80
  r11  0x0000000000000200
  r12  0x0000000000000001
  r13  0x0000000000000000
  r14  0x0000000000409520
  r15  0x0000000000000000
  trap 0x0000000d General Protection
  gsbs 0x0000000000000000
  fsbs 0x0000000000000000
  err  0x--------00000000
  rip  0x00000000004005f0
  cs   0x------------0023
  flag 0x0000000000010206
  rsp  0x00007f7fff9feb48
  ss   0x------------001b
err 0x0 (for PFs: User 4, Wr 2, Rd 1), aux 0x0000000000000000
Addr 0x00000000004005f0 is in syz-executor at offset 0x00000000000005f0
VM Regions for proc 540
NR:                                     Range:       Prot,      Flags,               File,                Off
00: (0x0000000000400000 - 0x00000000004b2000): 0x00000005, 0x00000001, 0xffff800101103840, 0x0000000000000000
01: (0x00000000004b2000 - 0x00000000004b3000): 0x00000005, 0x00000002, 0xffff800101103840, 0x00000000000b2000
02: (0x00000000006b3000 - 0x00000000006b6000): 0x00000003, 0x00000002, 0xffff800101103840, 0x00000000000b3000
03: (0x00000000006b6000 - 0x0000000000925000): 0x00000003, 0x00000002, 0x0000000000000000, 0x0000000000000000
04: (0x0000100000000000 - 0x0000100000024000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
05: (0x0000300000000000 - 0x0000300000001000): 0x00000003, 0x00000002, 0xffff800101103840, 0x0000000000000000
06: (0x0000300000001000 - 0x0000300000005000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
07: (0x0000300000005000 - 0x0000300000007000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
08: (0x0000300000007000 - 0x0000300000031000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
09: (0x0000300000031000 - 0x000030000005d000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
10: (0x00007f7fff8ff000 - 0x00007f7fff9ff000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000

Backtrace of user context on Core 5:
    Offsets only matter for shared libraries
#01 Addr 0x00000000004005f0 is in syz-executor at offset 0x00000000000005f0
#02 Addr 0x0000000000410394 is in syz-executor at offset 0x0000000000010394
#03 Addr 0x00009214b0000000 has no VMR
Unhandled user trap in vcore context from VC 0
HW TRAP frame (partial) at 0xffffffffc82cbaa0 on core 0
  rax  0x0000100000005df0
  rbx  0x00007f7fff9feaf0
  rcx  0x00000000004368ee
  rdx  0x0000100000005d00
  rbp  0x00007f7fff9fead0
  rsi  0x00001000000046c0
  rdi  0x00001000000046c0
  r8   0x0000000000000000
  r9   0x0000000000000000
  r10  0x00007f7fff9feaf0
  r11  0x0000000000000200
  r12  0x0000000000000001
  r13  0x0000000000000000
  r14  0x0000000000415400
  r15  0x0000000000000000
  trap 0x0000000d General Protection
  gsbs 0x0000000000000000
  fsbs 0x0000000000000000
  err  0x--------00000000
  rip  0x00000000004005f0
  cs   0x------------0023
  flag 0x0000000000010283
  rsp  0x00007f7fff9feab8
  ss   0x------------001b
err 0x0 (for PFs: User 4, Wr 2, Rd 1), aux 0x0000000000000000
Addr 0x00000000004005f0 is in syz-executor at offset 0x00000000000005f0
VM Regions for proc 506
NR:                                     Range:       Prot,      Flags,               File,                Off
00: (0x0000000000400000 - 0x00000000004b2000): 0x00000005, 0x00000001, 0xffff800101103840, 0x0000000000000000
01: (0x00000000004b2000 - 0x00000000004b3000): 0x00000005, 0x00000002, 0xffff800101103840, 0x00000000000b2000
02: (0x00000000006b3000 - 0x00000000006b6000): 0x00000003, 0x00000002, 0xffff800101103840, 0x00000000000b3000
03: (0x00000000006b6000 - 0x0000000000925000): 0x00000003, 0x00000002, 0x0000000000000000, 0x0000000000000000
04: (0x0000100000000000 - 0x0000100000024000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
05: (0x0000300000000000 - 0x0000300000001000): 0x00000003, 0x00000002, 0xffff800101103840, 0x0000000000000000
06: (0x0000300000001000 - 0x0000300000005000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
07: (0x0000300000005000 - 0x0000300000007000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
08: (0x0000300000007000 - 0x0000300000019000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
09: (0x00007f7fff8ff000 - 0x00007f7fff9ff000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000

Backtrace of user context on Core 0:
    Offsets only matter for shared libraries
#01 Addr 0x00000000004005f0 is in syz-executor at offset 0x00000000000005f0
#02 Addr 0x0000000000410394 is in syz-executor at offset 0x0000000000010394
#03 Addr 0x000000000c00007f has no VMR
kernel panic at kern/src/pagemap.c:222, from core 0: assertion failed: page && pm_slot_check_refcnt(*page->pg_tree_slot)
Entering Nanwan's Dungeon on Core 0 (Ints on):
Type 'help' for a list of commands.
akaros-notifier commented 7 years ago

Hi -

On 2017-10-16 at 10:09 Dmitry Vyukov wrote:

I am getting the following crashes. Is it a know issue? If not and you don't see why it happens right away, I can try to create a reproducer.

I don't recognize this one. It looks like there are a couple issues.

I'd need to see a little of the ASM for the GP faults to know why userspace is faulting.

The kernel panic is also interesting. My guess is a refcounting problem. I might be able to figure it out from a backtrace ('bt' from the monitor) and the value of *page->pg_tree_slot. But a reproducer might be needed.

dvyukov commented 7 years ago

Here is the backtrace:

Stack Backtrace on Core 0:
#01 [<0xffffffffc2016024>] in mon_backtrace
#02 [<0xffffffffc2017127>] in monitor
#03 [<0xffffffffc200cbea>] in _panic
#04 [<0xffffffffc2044c9b>] in pm_load_page
#05 [<0xffffffffc205bba1>] in generic_file_read
#06 [<0xffffffffc2006d75>] in is_valid_elf
#07 [<0xffffffffc205602d>] in sys_exec
#08 [<0xffffffffc2056499>] in syscall
#09 [<0xffffffffc2056654>] in run_local_syscall
#10 [<0xffffffffc20a231a>] in sysenter_callwrapper

There is no source/line info in obj/kern/akaros-kernel, so I can't map this to lines.

dvyukov commented 7 years ago

I failed to create a C reproducer. If I am reading this correctly, sys_exec is exec system call. Fuzzer itself does not call exec. So I wonder what calls exec. This probably explains why I can't create a standalone repro. Do you see from crash message what is the process that caused the panic?

bash-4.3$ 
bash-4.3$ Unhandled user trap in vcore context from VC 5
HW TRAP frame (partial) at 0xffffffffc82cc720 on core 5
  rax  0x0000100000011740
  rbx  0x000030000005cec0
  rcx  0x0000000000000001
  rdx  0x0000100000011740
  rbp  0x000030000005cea0
  rsi  0x000010000000f720
  rdi  0x000010000000f720
  r8   0x0000000000000000
  r9   0x0000000000000000
  r10  0x000030000005cec0
  r11  0x0000000000000200
  r12  0x0000000000000001
  r13  0x0000000000000005
  r14  0x00000000004095d0
  r15  0x0000000000000000
  trap 0x0000000e Page Fault
  gsbs 0x0000000000000000
  fsbs 0x0000000000000000
  err  0x--------00000006
  rip  0x0000000000400fff
  cs   0x------------0023
  flag 0x0000000000010a86
  rsp  0x000030000005ce88
  ss   0x------------001b
err 0x6 (for PFs: User 4, Wr 2, Rd 1), aux 0x00002fffeb89ce88
Addr 0x0000000000400fff is in syz-executor at offset 0x0000000000000fff
VM Regions for proc 44
NR:                                     Range:       Prot,      Flags,               File,                Off
00: (0x0000000000400000 - 0x00000000004b2000): 0x00000005, 0x00000001, 0xffff800100c86420, 0x0000000000000000
01: (0x00000000004b2000 - 0x00000000004b3000): 0x00000005, 0x00000002, 0xffff800100c86420, 0x00000000000b2000
02: (0x00000000006b3000 - 0x00000000006b6000): 0x00000003, 0x00000002, 0xffff800100c86420, 0x00000000000b3000
03: (0x00000000006b6000 - 0x0000000000925000): 0x00000003, 0x00000002, 0x0000000000000000, 0x0000000000000000
04: (0x0000100000000000 - 0x0000100000024000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
05: (0x0000300000000000 - 0x0000300000001000): 0x00000003, 0x00000002, 0xffff800100c86420, 0x0000000000000000
06: (0x0000300000001000 - 0x0000300000005000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
07: (0x0000300000005000 - 0x0000300000007000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
08: (0x0000300000007000 - 0x0000300000031000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
09: (0x0000300000031000 - 0x000030000005d000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
10: (0x00007f7fff8ff000 - 0x00007f7fff9ff000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000

Backtrace of user context on Core 5:
    Offsets only matter for shared libraries
#01 Addr 0x0000000000400fff is in syz-executor at offset 0x0000000000000fff
#02 Addr 0x0000000000410444 is in syz-executor at offset 0x0000000000010444
#03 Addr 0x000000000040900c is in syz-executor at offset 0x000000000000900c
#04 Addr 0x0000000000415709 is in syz-executor at offset 0x0000000000015709
#05 Addr 0x0000000000401756 is in syz-executor at offset 0x0000000000001756
#06 Addr 0x0000100000011740's VMR has no file
Unhandled user trap in vcore context from VC 3
HW TRAP frame (partial) at 0xffffffffc82cc4a0 on core 4
  rax  0x0000100000005d00
  rbx  0x00007f7fff9feb80
  rcx  0x0000000000000002
  rdx  0x0000100000005d00
  rbp  0x00007f7fff9feb60
  rsi  0x000010000000bfa0
  rdi  0x000010000000bfa0
  r8   0x0000000000000000
  r9   0x0000000000000000
  r10  0x00007f7fff9feb80
  r11  0x0000000000000200
  r12  0x0000000000000001
  r13  0x0000000000000003
  r14  0x00000000004097d0
  r15  0x0000015f265bbbf3
  trap 0x0000000e Page Fault
  gsbs 0x0000000000000000
  fsbs 0x0000000000000000
  err  0x--------00000006
  rip  0x0000000000400fff
  cs   0x------------0023
  flag 0x0000000000010202
  rsp  0x00007f7fff9feb48
  ss   0x------------001b
err 0x6 (for PFs: User 4, Wr 2, Rd 1), aux 0x00007f7feb23eb48
Addr 0x0000000000400fff is in syz-executor at offset 0x0000000000000fff
VM Regions for proc 44
NR:                                     Range:       Prot,      Flags,               File,                Off
00: (0x0000000000400000 - 0x00000000004b2000): 0x00000005, 0x00000001, 0xffff800100c86420, 0x0000000000000000
01: (0x00000000004b2000 - 0x00000000004b3000): 0x00000005, 0x00000002, 0xffff800100c86420, 0x00000000000b2000
02: (0x00000000006b3000 - 0x00000000006b6000): 0x00000003, 0x00000002, 0xffff800100c86420, 0x00000000000b3000
03: (0x00000000006b6000 - 0x0000000000925000): 0x00000003, 0x00000002, 0x0000000000000000, 0x0000000000000000
04: (0x0000100000000000 - 0x0000100000024000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
05: (0x0000300000000000 - 0x0000300000001000): 0x00000003, 0x00000002, 0xffff800100c86420, 0x0000000000000000
06: (0x0000300000001000 - 0x0000300000005000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
07: (0x0000300000005000 - 0x0000300000007000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
08: (0x0000300000007000 - 0x0000300000031000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
09: (0x0000300000031000 - 0x000030000005d000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
10: (0x00007f7fff8ff000 - 0x00007f7fff9ff000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000

Backtrace of user context on Core 4:
    Offsets only matter for shared libraries
#01 Addr 0x0000000000400fff is in syz-executor at offset 0x0000000000000fff
#02 Addr 0x0000000000410444 is in syz-executor at offset 0x0000000000010444
#03 Addr 0x0000000000409182 is in syz-executor at offset 0x0000000000009182
#04 Addr 0x0000000000415a01 is in syz-executor at offset 0x0000000000015a01
#05 Addr 0x0000000000401846 is in syz-executor at offset 0x0000000000001846
Unhandled user trap in vcore context from VC 0
HW TRAP frame (partial) at 0xffffffffc82cbaa0 on core 0
  rax  0x0000100000005d00
  rbx  0x00007f7fff9feaf0
  rcx  0x000000000043699e
  rdx  0x0000100000005d00
  rbp  0x00007f7fff9fead0
  rsi  0x00001000000046c0
  rdi  0x00001000000046c0
  r8   0x0000000000000000
  r9   0x0000000000000000
  r10  0x00007f7fff9feaf0
  r11  0x0000000000000200
  r12  0x0000000000000001
  r13  0x0000000000000000
  r14  0x00000000004154b0
  r15  0x0000000000000000
  trap 0x0000000e Page Fault
  gsbs 0x0000000000000000
  fsbs 0x0000000000000000
  err  0x--------00000006
  rip  0x0000000000400fff
  cs   0x------------0023
  flag 0x0000000000010202
  rsp  0x00007f7fff9feab8
  ss   0x------------001b
err 0x6 (for PFs: User 4, Wr 2, Rd 1), aux 0x00007f7feb23eab8
Addr 0x0000000000400fff is in syz-executor at offset 0x0000000000000fff
VM Regions for proc 30
NR:                                     Range:       Prot,      Flags,               File,                Off
00: (0x0000000000400000 - 0x00000000004b2000): 0x00000005, 0x00000001, 0xffff800100c86420, 0x0000000000000000
01: (0x00000000004b2000 - 0x00000000004b3000): 0x00000005, 0x00000002, 0xffff800100c86420, 0x00000000000b2000
02: (0x00000000006b3000 - 0x00000000006b6000): 0x00000003, 0x00000002, 0xffff800100c86420, 0x00000000000b3000
03: (0x00000000006b6000 - 0x0000000000925000): 0x00000003, 0x00000002, 0x0000000000000000, 0x0000000000000000
04: (0x0000100000000000 - 0x0000100000024000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
05: (0x0000300000000000 - 0x0000300000001000): 0x00000003, 0x00000002, 0xffff800100c86420, 0x0000000000000000
06: (0x0000300000001000 - 0x0000300000005000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
07: (0x0000300000005000 - 0x0000300000007000): 0x00000007, 0x00000022, 0x0000000000000000, 0x0000000000000000
08: (0x0000300000007000 - 0x0000300000019000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000
09: (0x00007f7fff8ff000 - 0x00007f7fff9ff000): 0x00000003, 0x00000022, 0x0000000000000000, 0x0000000000000000

Backtrace of user context on Core 0:
    Offsets only matter for shared libraries
#01 Addr 0x0000000000400fff is in syz-executor at offset 0x0000000000000fff
#02 Addr 0x0000000000410444 is in syz-executor at offset 0x0000000000010444
#03 Addr 0x00000000004369b9 is in syz-executor at offset 0x00000000000369b9
#04 Addr 0x0000000000435ea6 is in syz-executor at offset 0x0000000000035ea6
#05 Addr 0x00000000004019c9 is in syz-executor at offset 0x00000000000019c9
#06 Addr 0x0000000000000000 has no VMR
kernel panic at kern/src/pagemap.c:222, from core 0: assertion failed: page && pm_slot_check_refcnt(*page->pg_tree_slot)
Entering Nanwan's Dungeon on Core 0 (Ints on):
Type 'help' for a list of commands.
ROS(Core 0)> 
ROS(Core 0)> 
ROS(Core 0)> bt
Stack Backtrace on Core 0:
#01 [<0xffffffffc2016024>] in mon_backtrace
#02 [<0xffffffffc2017127>] in monitor
#03 [<0xffffffffc200cbea>] in _panic
#04 [<0xffffffffc2044c9b>] in pm_load_page
#05 [<0xffffffffc205bba1>] in generic_file_read
#06 [<0xffffffffc2006d75>] in is_valid_elf
#07 [<0xffffffffc205602d>] in sys_exec
#08 [<0xffffffffc2056499>] in syscall
#09 [<0xffffffffc2056654>] in run_local_syscall
#10 [<0xffffffffc20a231a>] in sysenter_callwrapper
ROS(Core 0)> ps
     PID Name                 State      Parent    
-------------------------------------------------
      13 /bin/ipconfig        WAITING         0
      47 sh                   RUNNING_S      46
      22 dropbear             WAITING         1
      45 dropbear             WAITING        22
      46 sh                   WAITING        45
      23 bash                 WAITING         1
       1 bash                 WAITING         0
      19 /bin/cs              WAITING         0
dvyukov commented 7 years ago

This can be reproduced by running whole fuzzer, though. Repro steps: Assuming linux/amd64 host. Download Go1.9.1 toolchain from https://golang.org/dl/ Unpack to some local dir, set GOROOT to that dir.

$ go get github.com/google/syzkaller
$ cd ~/go/src/github.com/google/syzkaller
$ git checkout 8793f74c6cb46d87b53758c6d99705b8018ceeba # current HEAD
$ make stress
$ make executor TARGETOS=akaros SOURCEDIR=/bootstrapped/akaros/checkout
# assuming you have ssh with a key setup
$ scp -P 5555 -i akaros_id_rsa -o IdentitiesOnly=yes ./bin/akaros_amd64/syz-executor root@localhost:/
$ bin/linux_amd64/syz-stress -os=akaros -arch=amd64 -timeout=5s -executor "/usr/bin/ssh -p 5555 -i akaros_id_rsa -o IdentitiesOnly=yes root@localhost /syz-executor"

This crashes kernel in <10 seconds for me.

akaros-notifier commented 7 years ago

On 2017-10-16 at 18:09 Dmitry Vyukov notifications@github.com wrote:

I failed to create a C reproducer. If I am reading this correctly, sys_exec is exec system call. Fuzzer itself does not call exec. So I wonder what calls exec. This probably explains why I can't create a standalone repro. Do you see from crash message what is the process that caused the panic?

It looks like sh, since it was the process running at the time. Do you have a bash script of some sort running to drive syz-executor?

As far as the backtrace goes, you can use:

addr2line -e obj/kern/akaros-kernel-64

Though in this case, it won't help much - I can see the codepath regardless of line numbers. It looks like we're just failing to read a file in generic_file_read().

akaros-notifier commented 7 years ago

On 2017-10-16 at 18:17 Dmitry Vyukov notifications@github.com wrote:

This can be reproduced by running whole fuzzer, though.

Thanks, I'll take a look.

dvyukov commented 7 years ago

What's strange is that all crashes mention RIP=0x0000000000400fff. But it does not point to any instruction in the binary. It's not mine init_cacheinfo function called somewhere from pthread. And it's also last byte of the first page of text section (which is paged in for the first time?). Maybe it rings any bells for you.

Disassembly of section .text:
0000000000400d90 <init_cacheinfo>:
  400d90:   55                      push   %rbp
  400d91:   48 89 e5                mov    %rsp,%rbp
  400d94:   41 56                   push   %r14
  400d96:   41 55                   push   %r13
  400d98:   41 54                   push   %r12
  400d9a:   45 31 e4                xor    %r12d,%r12d
  400d9d:   53                      push   %rbx
  400d9e:   44 89 e0                mov    %r12d,%eax
  400da1:   0f a2                   cpuid  
  400da3:   81 f9 6e 74 65 6c       cmp    $0x6c65746e,%ecx
  400da9:   41 89 c4                mov    %eax,%r12d
  400dac:   40 0f 94 c6             sete   %sil
  400db0:   81 fb 47 65 6e 75       cmp    $0x756e6547,%ebx
  400db6:   0f 94 c0                sete   %al
  400db9:   40 84 c6                test   %al,%sil
  400dbc:   74 08                   je     400dc6 <init_cacheinfo+0x36>
  400dbe:   81 fa 69 6e 65 49       cmp    $0x49656e69,%edx
  400dc4:   74 2b                   je     400df1 <init_cacheinfo+0x61>
  400dc6:   81 f9 63 41 4d 44       cmp    $0x444d4163,%ecx
  400dcc:   0f 94 c1                sete   %cl
  400dcf:   81 fb 41 75 74 68       cmp    $0x68747541,%ebx
  400dd5:   0f 94 c0                sete   %al
  400dd8:   84 c1                   test   %al,%cl
  400dda:   74 0c                   je     400de8 <init_cacheinfo+0x58>
  400ddc:   81 fa 65 6e 74 69       cmp    $0x69746e65,%edx
  400de2:   0f 84 0e 01 00 00       je     400ef6 <init_cacheinfo+0x166>
  400de8:   5b                      pop    %rbx
  400de9:   41 5c                   pop    %r12
  400deb:   41 5d                   pop    %r13
  400ded:   41 5e                   pop    %r14
  400def:   5d                      pop    %rbp
  400df0:   c3                      retq   
  400df1:   44 89 e6                mov    %r12d,%esi
  400df4:   bf bc 00 00 00          mov    $0xbc,%edi
  400df9:   e8 d2 39 03 00          callq  4347d0 <handle_intel>
  400dfe:   bf c2 00 00 00          mov    $0xc2,%edi
  400e03:   44 89 e6                mov    %r12d,%esi
  400e06:   49 89 c6                mov    %rax,%r14
  400e09:   e8 c2 39 03 00          callq  4347d0 <handle_intel>
  400e0e:   48 85 c0                test   %rax,%rax
  400e11:   49 89 c5                mov    %rax,%r13
  400e14:   bf 03 00 00 00          mov    $0x3,%edi
  400e19:   0f 8e f7 01 00 00       jle    401016 <init_cacheinfo+0x286>
  400e1f:   b8 01 00 00 00          mov    $0x1,%eax
  400e24:   0f a2                   cpuid  
  400e26:   81 e1 00 02 00 00       and    $0x200,%ecx
  400e2c:   89 de                   mov    %ebx,%esi
  400e2e:   83 f9 01                cmp    $0x1,%ecx
  400e31:   19 c0                   sbb    %eax,%eax
  400e33:   83 c0 03                add    $0x3,%eax
  400e36:   41 83 fc 03             cmp    $0x3,%r12d
  400e3a:   89 05 c8 34 52 00       mov    %eax,0x5234c8(%rip)        # 924308 <__x86_preferred_memory_instruction>
  400e40:   7e 2a                   jle    400e6c <init_cacheinfo+0xdc>
  400e42:   31 c9                   xor    %ecx,%ecx
  400e44:   41 b8 04 00 00 00       mov    $0x4,%r8d
  400e4a:   eb 13                   jmp    400e5f <init_cacheinfo+0xcf>
  400e4c:   89 c2                   mov    %eax,%edx
  400e4e:   44 89 c9                mov    %r9d,%ecx
  400e51:   c1 ea 05                shr    $0x5,%edx
  400e54:   83 e2 07                and    $0x7,%edx
  400e57:   39 fa                   cmp    %edi,%edx
  400e59:   0f 84 5e 01 00 00       je     400fbd <init_cacheinfo+0x22d>
  400e5f:   44 8d 49 01             lea    0x1(%rcx),%r9d
  400e63:   44 89 c0                mov    %r8d,%eax
  400e66:   0f a2                   cpuid  
  400e68:   a8 1f                   test   $0x1f,%al
  400e6a:   75 e0                   jne    400e4c <init_cacheinfo+0xbc>
  400e6c:   c1 ee 10                shr    $0x10,%esi
  400e6f:   40 0f b6 f6             movzbl %sil,%esi
  400e73:   85 f6                   test   %esi,%esi
  400e75:   74 10                   je     400e87 <init_cacheinfo+0xf7>
  400e77:   4d 85 ed                test   %r13,%r13
  400e7a:   7e 0b                   jle    400e87 <init_cacheinfo+0xf7>
  400e7c:   4c 89 e8                mov    %r13,%rax
  400e7f:   48 99                   cqto   
  400e81:   48 f7 fe                idiv   %rsi
  400e84:   49 89 c5                mov    %rax,%r13
  400e87:   4d 85 f6                test   %r14,%r14
  400e8a:   7e 2c                   jle    400eb8 <init_cacheinfo+0x128>
  400e8c:   4c 89 f0                mov    %r14,%rax
  400e8f:   4c 89 35 9a 41 2b 00    mov    %r14,0x2b419a(%rip)        # 6b5030 <__x86_raw_data_cache_size>
  400e96:   41 80 e6 00             and    $0x0,%r14b
  400e9a:   48 d1 f8                sar    %rax
  400e9d:   4c 89 35 9c 41 2b 00    mov    %r14,0x2b419c(%rip)        # 6b5040 <__x86_data_cache_size>
  400ea4:   48 89 05 8d 41 2b 00    mov    %rax,0x2b418d(%rip)        # 6b5038 <__x86_raw_data_cache_size_half>
  400eab:   4c 89 f0                mov    %r14,%rax
  400eae:   48 d1 f8                sar    %rax
  400eb1:   48 89 05 90 41 2b 00    mov    %rax,0x2b4190(%rip)        # 6b5048 <__x86_data_cache_size_half>
  400eb8:   4d 85 ed                test   %r13,%r13
  400ebb:   0f 8e 27 ff ff ff       jle    400de8 <init_cacheinfo+0x58>
  400ec1:   4c 89 e8                mov    %r13,%rax
  400ec4:   4c 89 2d 45 41 2b 00    mov    %r13,0x2b4145(%rip)        # 6b5010 <__x86_raw_shared_cache_size>
  400ecb:   41 80 e5 00             and    $0x0,%r13b
  400ecf:   48 d1 f8                sar    %rax
  400ed2:   4c 89 2d 47 41 2b 00    mov    %r13,0x2b4147(%rip)        # 6b5020 <__x86_shared_cache_size>
  400ed9:   48 89 05 38 41 2b 00    mov    %rax,0x2b4138(%rip)        # 6b5018 <__x86_raw_shared_cache_size_half>
  400ee0:   4c 89 e8                mov    %r13,%rax
  400ee3:   48 d1 f8                sar    %rax
  400ee6:   5b                      pop    %rbx
  400ee7:   48 89 05 3a 41 2b 00    mov    %rax,0x2b413a(%rip)        # 6b5028 <__x86_shared_cache_size_half>
  400eee:   41 5c                   pop    %r12
  400ef0:   41 5d                   pop    %r13
  400ef2:   41 5e                   pop    %r14
  400ef4:   5d                      pop    %rbp
  400ef5:   c3                      retq   
  400ef6:   bf bc 00 00 00          mov    $0xbc,%edi
  400efb:   e8 f0 39 03 00          callq  4348f0 <handle_amd>
  400f00:   bf bf 00 00 00          mov    $0xbf,%edi
  400f05:   49 89 c6                mov    %rax,%r14
  400f08:   e8 e3 39 03 00          callq  4348f0 <handle_amd>
  400f0d:   bf c2 00 00 00          mov    $0xc2,%edi
  400f12:   49 89 c5                mov    %rax,%r13
  400f15:   e8 d6 39 03 00          callq  4348f0 <handle_amd>
  400f1a:   41 b8 01 00 00 00       mov    $0x1,%r8d
  400f20:   48 89 c7                mov    %rax,%rdi
  400f23:   be 00 00 00 80          mov    $0x80000000,%esi
  400f28:   44 89 c0                mov    %r8d,%eax
  400f2b:   0f a2                   cpuid  
  400f2d:   c1 e1 16                shl    $0x16,%ecx
  400f30:   89 f0                   mov    %esi,%eax
  400f32:   c1 f9 1f                sar    $0x1f,%ecx
  400f35:   83 e1 03                and    $0x3,%ecx
  400f38:   89 0d ca 33 52 00       mov    %ecx,0x5233ca(%rip)        # 924308 <__x86_preferred_memory_instruction>
  400f3e:   0f a2                   cpuid  
  400f40:   48 85 ff                test   %rdi,%rdi
  400f43:   89 c6                   mov    %eax,%esi
  400f45:   7e 2c                   jle    400f73 <init_cacheinfo+0x1e3>
  400f47:   3d 07 00 00 80          cmp    $0x80000007,%eax
  400f4c:   76 54                   jbe    400fa2 <init_cacheinfo+0x212>
  400f4e:   be 08 00 00 80          mov    $0x80000008,%esi
  400f53:   89 f0                   mov    %esi,%eax
  400f55:   0f a2                   cpuid  
  400f57:   c1 e9 0c                shr    $0xc,%ecx
  400f5a:   89 c6                   mov    %eax,%esi
  400f5c:   83 e1 0f                and    $0xf,%ecx
  400f5f:   41 d3 e0                shl    %cl,%r8d
  400f62:   44 89 c1                mov    %r8d,%ecx
  400f65:   48 89 f8                mov    %rdi,%rax
  400f68:   48 99                   cqto   
  400f6a:   48 f7 f9                idiv   %rcx
  400f6d:   48 89 c7                mov    %rax,%rdi
  400f70:   49 01 fd                add    %rdi,%r13
  400f73:   81 fe 00 00 00 80       cmp    $0x80000000,%esi
  400f79:   0f 86 08 ff ff ff       jbe    400e87 <init_cacheinfo+0xf7>
  400f7f:   b8 01 00 00 80          mov    $0x80000001,%eax
  400f84:   0f a2                   cpuid  
  400f86:   80 e5 01                and    $0x1,%ch
  400f89:   75 08                   jne    400f93 <init_cacheinfo+0x203>
  400f8b:   85 d2                   test   %edx,%edx
  400f8d:   0f 89 f4 fe ff ff       jns    400e87 <init_cacheinfo+0xf7>
  400f93:   c7 05 6f 33 52 00 ff    movl   $0xffffffff,0x52336f(%rip)        # 92430c <__x86_prefetchw>
  400f9a:   ff ff ff 
  400f9d:   e9 e5 fe ff ff          jmpq   400e87 <init_cacheinfo+0xf7>
  400fa2:   44 89 c0                mov    %r8d,%eax
  400fa5:   0f a2                   cpuid  
  400fa7:   81 e2 00 00 00 10       and    $0x10000000,%edx
  400fad:   89 c6                   mov    %eax,%esi
  400faf:   74 bf                   je     400f70 <init_cacheinfo+0x1e0>
  400fb1:   c1 eb 10                shr    $0x10,%ebx
  400fb4:   0f b6 cb                movzbl %bl,%ecx
  400fb7:   85 c9                   test   %ecx,%ecx
  400fb9:   74 b5                   je     400f70 <init_cacheinfo+0x1e0>
  400fbb:   eb a8                   jmp    400f65 <init_cacheinfo+0x1d5>
  400fbd:   c1 e8 0e                shr    $0xe,%eax
  400fc0:   25 ff 03 00 00          and    $0x3ff,%eax
  400fc5:   89 c6                   mov    %eax,%esi
  400fc7:   74 45                   je     40100e <init_cacheinfo+0x27e>
  400fc9:   41 83 fc 0a             cmp    $0xa,%r12d
  400fcd:   7e 3f                   jle    40100e <init_cacheinfo+0x27e>
  400fcf:   31 d2                   xor    %edx,%edx
  400fd1:   41 b8 0b 00 00 00       mov    $0xb,%r8d
  400fd7:   8d 7a 01                lea    0x1(%rdx),%edi
  400fda:   44 89 c0                mov    %r8d,%eax
  400fdd:   89 d1                   mov    %edx,%ecx
  400fdf:   0f a2                   cpuid  
  400fe1:   81 e1 f0 0f 00 00       and    $0xff0,%ecx
  400fe7:   0f b6 db                movzbl %bl,%ebx
  400fea:   74 22                   je     40100e <init_cacheinfo+0x27e>
  400fec:   85 db                   test   %ebx,%ebx
  400fee:   74 1e                   je     40100e <init_cacheinfo+0x27e>
  400ff0:   81 f9 00 02 00 00       cmp    $0x200,%ecx
  400ff6:   89 fa                   mov    %edi,%edx
  400ff8:   75 dd                   jne    400fd7 <init_cacheinfo+0x247>
  400ffa:   0f bd f6                bsr    %esi,%esi
  400ffd:   8d 4e 01                lea    0x1(%rsi),%ecx
  401000:   83 c8 ff                or     $0xffffffff,%eax
  401003:   83 eb 01                sub    $0x1,%ebx
  401006:   d3 e0                   shl    %cl,%eax
  401008:   89 c6                   mov    %eax,%esi
  40100a:   f7 d6                   not    %esi
  40100c:   21 de                   and    %ebx,%esi
  40100e:   83 c6 01                add    $0x1,%esi
  401011:   e9 5d fe ff ff          jmpq   400e73 <init_cacheinfo+0xe3>
  401016:   40 b7 bf                mov    $0xbf,%dil
  401019:   44 89 e6                mov    %r12d,%esi
  40101c:   e8 af 37 03 00          callq  4347d0 <handle_intel>
  401021:   bf 02 00 00 00          mov    $0x2,%edi
  401026:   49 89 c5                mov    %rax,%r13
  401029:   e9 f1 fd ff ff          jmpq   400e1f <init_cacheinfo+0x8f>
brho commented 7 years ago

Hi -

I was able to recreate the bug. I haven't solved it yet, but I have a bunch of leads.

One minor thing: I had to change the CC variable in your Makefile for the executor to build it in my setup. This way should work for everyone:

diff --git a/Makefile b/Makefile
index 4e7e7ddca99b..e58cc4985ed8 100644
--- a/Makefile
+++ b/Makefile
@@ -94,11 +94,8 @@ ifeq ("$(TARGETOS)", "fuchsia")
 endif

 ifeq ("$(TARGETOS)", "akaros")
-   # SOURCEDIR should point to bootstrapped akaros checkout.
    # There is no up-to-date Go for akaros, so building Go will fail.
-   CC = $(SOURCEDIR)/install/x86_64-ucb-akaros-gcc/bin/x86_64-ucb-akaros-g++
-   # Most likely this is incorrect (why doesn't it know own sysroot?), but worked for me.
-   ADDCFLAGS = -I $(SOURCEDIR)/tools/compilers/gcc-glibc/x86_64-ucb-akaros-gcc-stage3-builddir/x86_64-ucb-akaros/libstdc++-v3/include/x86_64-ucb-akaros -I $(SOURCEDIR)/tools/compilers/gcc-glibc/x86_64-ucb-akaros-gcc-stage3-builddir/x86_64-ucb-akaros/libstdc++-v3/include -I $(SOURCEDIR)/tools/compilers/gcc-glibc/gcc-4.9.2/libstdc++-v3/libsupc++ -L $(SOURCEDIR)/tools/compilers/gcc-glibc/x86_64-ucb-akaros-gcc-stage3-builddir/x86_64-ucb-akaros/libstdc++-v3/src/.libs
+   CC = $(AKAROS_XCC_ROOT)/bin/x86_64-ucb-akaros-g++
 endif

 ifeq ("$(TARGETOS)", "windows")

$AKAROS_XCC_ROOT is an environment variable that everyone should use to point to the toolchain installation.

I didn't need the ADDCFLAGS either, though maybe that's a peculiarity of my setup. The only other thing I do is put $(AKAROS_XCC_ROOT)/bin/ in my $PATH, though I don't see why that would help.

Anyway, thanks for the bug - I'll post more when I solve it.

dvyukov commented 7 years ago

Humm... If I remove ADDCFLAGS, make executor TARGETOS=akaros SOURCEDIR=/src/akaros fails with:

/src/akaros/install/x86_64-ucb-akaros-gcc/bin/x86_64-ucb-akaros-g++ -o ./bin/akaros_amd64/syz-executor executor/executor_akaros.cc \
        -pthread -Wall -Wframe-larger-than=8192 -Wparentheses -Werror -O1 \
        -static  -DGOOS=\"akaros\" -DGIT_REVISION=\"5b3a76c9f8b55281f244b8f81e48c5b0b935ccc3+\"
In file included from executor/executor_akaros.cc:11:0:
executor/executor.h:4:21: fatal error: algorithm: No such file or directory
 #include <algorithm>

I've bootstrapped the toolchain using these commands:

(cd $AKAROS_ROOT && make ARCH=x86 defconfig)
(cd $AKAROS_ROOT && make xcc-upgrade-from-scratch)

How is your setup different? I would definitely like to remove that ADDCFLAGS mess, but I want it to work for me as well :)

dvyukov commented 7 years ago

Re SOURCEDIR vs AKAROS_XCC_ROOT: SOURCEDIR is env var name that we use for different things (e.g. also extracting values of constants from OS headers), and it's also a common name across multiple OSes (e.g. also used for linux, fuchsia, freebsd, etc). E.g. you do make TARGETOS=foo SOURCEDIR=/path/to/foo/checkout and Makefile and other tools figure out all other locations from SOURCEDIR. Is AKAROS_ROOT also a standard env var name (I've seen it in some instructions)? If yes, then it would be reasonable to provide a default value for SOURCEDIR if AKAROS_ROOT is set. E.g. SOURCEDIR ?= AKAROS_ROOT (or what's the syntax for this in Makefiles).

brho commented 7 years ago

We have two standard env variables. AKAROS_ROOT is the git repo you've downloaded. AKAROS_XCC_ROOT is the location of the toolchain installation - basically everything that gcc/binutils/glibc creates, plus our kernel headers and user libraries. SOURCEDIR sounds like AKAROS_ROOT, though things like kernel headers are also available in AKAROS_XCC_ROOT.

So far, we mostly use AKAROS_XCC_ROOT to find the cross compiler. AKAROS_ROOT is often used for installing cross-compiled binaries into the kernel file system (e.g. $AKAROS_ROOT/kern/kfs/bin, which will end up in /bin).

It seems pretty odd that your cross compiler doesn't know where to look for header files - or at least some of them?

Did you move the toolchain after building it? (That seems unlikely.)

You should have set the env variable X86_64_INSTDIR during toolchain installation too, usually in tools/compilers/gcc-glibc/Makelocal. If you skipped that, the build should have given you an error. But if not, that could mess things up too. (SYSROOT and install locations derive from that).

One option would be to strace -e open,access your build command, and then we can see where the compiler was looking.

dvyukov commented 7 years ago

It looks at multiple locations under /src/akaros/install/x86_64-ucb-akaros-gcc/ but I don't have algorithm file anywhere in that path. I suspect that maybe gcc install failed somewhere in the middle, but installed the binaries at that point, that would explain why I have headers in the build dir, but not in install dir.

I guess I need to try to rebootstrap everything from scratch before we spend more time on this.

Don't you have AKAROS_XCC_ROOT point to "$AKAROS_ROOT/install/x86_64-ucb-akaros-gcc"? If yes, than the current Makefile should work as well (provided that you run it as make TARGETOS=akaros SOURCEDIR=$AKAROS_ROOT).

brho commented 7 years ago

That sounds right - you should have those files in the toolchain. e.g. $ find $AKAROS_XCC_ROOT -name algorithm should have something (like x86_64-ucb-akaros/include/c++/4.9.2/algorithm).

If you look in tools/compilers/gcc-glibc/build_logs/, you might find what died. If a fresh rebuild doesn't help, then you can email them to me or something and I can look for a problem.

I don't have my AKAROS_XCC_ROOT pointing to a location inside my AKAROS_ROOT. I have it set up like this:

AKAROS_ROOT -> $HOME/akaros/ros-kernel/
AKAROS_XCC_ROOT -> $HOME/ros-gcc-glibc/install-x86_64-ros-gcc/

Actually, since I have the bin directory of XCC_ROOT in my PATH, I can run make executor just like this:

ifeq ("$(TARGETOS)", "akaros")
    CC = x86_64-ucb-akaros-g++
endif
brho commented 7 years ago

btw, I just tracked down the bug(s). The main one was that under rare conditions (races with page faults on the syz-executor binary), the mm code would free a page that was in the page cache. The page would eventually get reused, which is why syz-executor would go crazy - a chunk of .text was garbage. When the page was reused, various refcnts/flags would be wrong too, which was ultimately responsible for the panic you found. (Short version: it was improperly decreffed, then when we increffed it we had a refcnt of 0, which was the panic).

Anyway, I'll have a patch out later today. With it, the stress tester ran without crashing Akaros.