Unimplemented mach trap (27)

facekapow commented 4 years ago

This is a very weird error that I'm getting with the latest code on master. It happens seemingly randomly whenever I'm in a bash shell, and it causes my entire container to crash.

Darling [/Volumes/SystemRoot/home/ariel/build/darling]$ cd DowUnimplemented mach trap (27)
Unimplemented mach trap (27)

Steps to reproduce

This is a pretty difficult bug to reproduce, as it seems to happen randomly. Basically, just open a Darling shell and run some commands.

Debugging

It seems that the Mach trap table does have an entry for trap 27:

https://github.com/darlinghq/darling/blob/3c7558316c7cb29ce24ecbb36efe4c7fcf43a1a2/src/kernel/emulation/linux/mach/mach_table.c#L22

...so I don't why it would be "unimplemented". The code that checks the table is here:

https://github.com/darlinghq/darling/blob/3c7558316c7cb29ce24ecbb36efe4c7fcf43a1a2/src/kernel/emulation/linux/mach/darling_mach_syscall.S#L7-L32

...and to me, it looks perfectly fine: it loads the table address into r10, adds the offset in rax multiplied by 8, and tests if it's NULL/0. If it is, then it calls ___unknown_mach_syscall. However, I can't see why entry 27 would be NULL in the syscall table.

I haven't been able to run xtrace or lldb on this yet because I haven't been able to reproduce it when I want to reproduce it.

LubosD commented 4 years ago

I've seen this as well, but IIRC there were also some weird lines in the kernel log - something about overlay and possible corruption. Next time please check you dmesg too.

TheBrokenRail commented 4 years ago

Just happened to me too, was just exiting and then suddenly my shell was completely filled with unimplemented mach traps (27).

r-wilbur commented 4 years ago

I got endless scrolling of this error message in my invocation of

darling shell
$ python ‘<python program with errors>’

Ctrl-C wouldn’t stop it so I killed the darling process. But not before I filled my partition with roughly equal parts /var/log/kern.log and /var/log/syslog—58GB each!

kern.log messages of the form:

May 30 17:09:17 eureka kernel: [123448.380390] not implemented: lck_mtx_destroy()
May 30 17:09:17 eureka kernel: [123448.383211] Darling TID 50115 (PID 50115) says: darling_sigexc_self()
May 30 17:09:17 eureka kernel: [123448.383335] <50115> - kobject routine: _Xhost_info+0x0/0x100 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.395592] Darling TID 50115 (PID 50115) says: darling_sigexc_self()
May 30 17:09:17 eureka kernel: [123448.395678] <50115> - kobject routine: _Xhost_info+0x0/0x100 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.395842] <50115> - kobject routine: _Xhost_get_clock_service+0x0/0x120 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.395864] <50115> - kobject routine: _Xsemaphore_create+0x0/0x130 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.396397] <50115> - kobject routine: _Xtask_get_special_port+0x0/0xf0 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.397590] Darling TID 50115 (PID 50115) says: dtype for fd 0 -> /dev/pts/5
May 30 17:09:17 eureka kernel: [123448.397597] Darling TID 50115 (PID 50115) says: dtype for fd 2 -> /dev/pts/5

[This happens maybe 10 times over the next 8:17 till all hell breaks loose at 17:17:34 with:]

[A——————————————————]
May 30 17:17:34 eureka kernel: [123945.656520] Darling TID 50115 (PID 50115) says: sigexec_handler(2, 0x7FFFF7D3C770, 0x7FFFF7D3C640)

[followed by more lines of the form]
May 30 17:17:34 eureka kernel: [123945.656528] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
have RIP 0x7FFFF7D27161
  gregs 0x0
  gregs 0x0
  gregs 0x0
  gregs 0x246
  gregs 0x4025C0
  gregs 0x7FFF0C0C2E70
  gregs 0x0
  gregs 0x0
  gregs 0x0
  gregs 0x7FFFFFDFCE0B
  gregs 0x7FFFFFDFCDA0
  gregs 0xCA299051359E6BAF
  gregs 0x1
  gregs 0x0
  gregs 0x7FFFF7D27163
  gregs 0x7FFFFFDFCD58
  gregs 0x7FFFF7D27161
  gregs 0x246
  gregs 0x2B000000000033
  gregs 0x0
  gregs 0x0
  gregs 0x0
  gregs 0x0
saving to kernel:
  rip: 0x7FFFF7D27161
  efl: 0x246
  rax: 0x0
  rbx: 0xCA299051359E6BAF
  rcx: 0x7FFFF7D27163
  rdx: 0x1
  rdi: 0x0
  rsi: 0x7FFFFFDFCE0B
  rbp: 0x7FFFFFDFCDA0
  rsp: 0x7FFFFFDFCD58
fcw out: 0x37F
xmm0 out: 0xFF0000
restored from kernel:
  rip: 0x7FFFF7D27161
  efl: 0x246
  rax: 0x0
  rbx: 0xCA299051359E6BAF
  rcx: 0x7FFFF7D27163
  rdx: 0x1
  rdi: 0x0
  rsi: 0x7FFFFFDFCE0B
  rbp: 0x7FFFFFDFCDA0
  rsp: 0x7FFFFFDFCD58
fcw in: 0x37F
xmm0 in: 0xFF0000
  gregs 0x0
  gregs 0x0
  gregs 0x0
  gregs 0x246
  gregs 0x4025C0
  gregs 0x7FFF0C0C2E70
  gregs 0x0
  gregs 0x0
  gregs 0x0
  gregs 0x7FFFFFDFCE0B
  gregs 0x7FFFFFDFCDA0
  gregs 0xCA299051359E6BAF
  gregs 0x1
  gregs 0x0
  gregs 0x7FFFF7D27163
  gregs 0x7FFFFFDFCD58
  gregs 0x7FFFF7D27161
  gregs 0x246
  gregs 0x2B000000000033
  gregs 0x0
  gregs 0x0
  gregs 0x0
  gregs 0x0
will forward signal to app handler (0x7FFFF7F4A920)
[A——————————————————]
[Section A repeats in 448μs]

[Near the end of the log it looks like.]
May 30 17:42:47 eureka kernel: [125458.460311] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
saving to kernel:
  rip: 0x7FFFF7D2714C
  efl: 0x10202
  rax: 0x203
  rbx: 0xCA299051359E6BAF
  rcx: 0x0
  rdx: 0xF7D3A550
  rdi: 0x1000002
  rsi: 0x7FFFF7D3A440
  rbp: 0x7FFFFFDFDBE0
  rsp: 0x7FFFFFDFDB68
fcw out: 0x37F
xmm0 out: 0x0
[Then Linux kernel messages with following form:]
May 30 17:42:47 eureka kernel: [125458.460347] <message>
[where message is]
<50115> exception_deliver - host level
<50115> about to schedule - my state: 1
<2257>  about to schedule - my state: 1

May 30 17:42:47 eureka kernel: [125458.460367] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
restored from kernel:
  rip: 0x7FFFF7D2714C
  efl: 0x10202
  rax: 0x203
  rbx: 0xCA299051359E6BAF
  rcx: 0x0
  rdx: 0xF7D3A550
  rdi: 0x1000002
  rsi: 0x7FFFF7D3A440
  rbp: 0x7FFFFFDFDBE0
  rsp: 0x7FFFFFDFDB68
fcw in: 0x37F
xmm0 in: 0x0
  gregs 0x0
  gregs 0x7FFFF7D2713B
  gregs 0x307
  gregs 0x4025C0
  gregs 0x7FFF0C0C2E70
  gregs 0x0
  gregs 0x0
  gregs 0x1000002
  gregs 0x7FFFF7D3A440
  gregs 0x7FFFFFDFDBE0
  gregs 0xCA299051359E6BAF
  gregs 0xF7D3A550
  gregs 0x203
  gregs 0x0
  gregs 0x7FFFFFDFDB68
  gregs 0x7FFFF7D2714C
  gregs 0x10202
  gregs 0x2B000000000033
  gregs 0x0
  gregs 0xD
  gregs 0x10002
  gregs 0x0
will forward signal to app handler (0x7FFFF7EDEBF0)
handler (11) returning
[then a line]
May 30 17:42:47 eureka kernel: [125458.460431] Darling TID 50115 (PID 50115) says: sigexec_handler(11, 0x7FFFF7D3C770, 0x7FFFF7D3C640)
[followed by more lines of the form]
May 30 17:42:47 eureka kernel: [125458.460433] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
have RIP 0x7FFFF7D2714C
  gregs 0x0
[etc., etc., etc., ad nauseum]

syslog is dominated by lines of the same form as kern.log

dmesg doesn’t show anything obvious to me. What should I be looking for in dmesg?

r-wilbur commented 4 years ago

From what I've seen, it looks like when we have an unimplemented syscall [in my case lck_mtx_destroy()] it gets mapped to calling trap 27 which tries to log the call stack and the name of the function that needs to be implemented.

Is there anything else any of you would like to see from my /var/log/kern.log? I'm happy to keep the evidence around as long as it is needed. I'd also like to recover the 58GB of disk space it is currently occupying.

LubosD commented 4 years ago

lck_mtx_destroy() is not a syscall, it's just an unimplemented function in the LKM.

You could try looking for any overlay-related error messages in the log.

r-wilbur commented 4 years ago

I plead ignorance of how I should look for an "overlay-related error message". Here is what I tried so far:

grep -i error /var/log/kern.log.1 >kern_errors.txt
grep -i overlay /var/log/kern.log.1 >kern_overlay.txt
grep -i '\.ko' /var/log/kern.log.1 >kern_ko.txt
ls -l kern_*.txt
 440 May 31 10:52 kern_errors.txt
   0 May 31 10:57 kern_ko.txt
   0 May 31 10:53 kern_overlay.txt
wc -l kern_errors.txt
4 kern_errors.txt
cat kern_errors.txt
May 24 19:54:09 eureka kernel: [269545.799653] ata1: SError: { PHYRdyChg CommWake }
May 27 03:11:57 eureka kernel: [468611.790743] traps: Chrome_ChildIOT[166766] trap invalid opcode ip:55dc46a44ab5 sp:7f5231b7dee0 error:0 in chrome[55dc43d78000+7858000]
May 28 18:38:26 eureka kernel: [    0.505540] RAS: Correctable Errors collector initialized.
May 30 18:28:40 eureka kernel: [    0.533632] RAS: Correctable Errors collector initialized.

I'm open for suggestions on how and where to look.

andyneff commented 3 years ago

I'm hitting this fairly often too.

I'm able to stop it by killing the bash pid (sometimes have to kill two bash pids) from linux

At the time of the last one happening, this is what is in my /var/log/messages

/var/log/messages

``` Jan 21 19:05:39 viola kernel: <3262540> about to schedule - my state: 1 Jan 21 19:05:39 viola kernel: <3262528> - kobject routine: 00000000fbd60bda Jan 21 19:05:39 viola kernel: <3262528> - kobject routine: 00000000fbd60bda Jan 21 19:05:39 viola kernel: <3262528> - kobject routine: 00000000fbd60bda Jan 21 19:05:39 viola kernel: <3262528> - kobject routine: 00000000fbd60bda Jan 21 19:05:39 viola kernel: <3262528> about to schedule - my state: 1 Jan 21 19:05:40 viola rsyslogd[1156]: imjournal: 161535 messages lost due to rate-limiting (20000 allowed within 600 seconds) Jan 21 19:05:40 viola kernel: <3263493> calling clear_wait Jan 21 19:05:40 viola kernel: <3262551> - kobject routine: 00000000ac8cb31b Jan 21 19:05:40 viola kernel: <3262551> about to schedule - my state: 1 Jan 21 19:05:40 viola kernel: <3263493> calling clear_wait Jan 21 19:05:41 viola kernel: <3262541> - kobject routine: 00000000ac8cb31b Jan 21 19:05:41 viola kernel: <3262541> about to schedule - my state: 1 Jan 21 19:05:44 viola kernel: <3267585> exception_deliver - host level Jan 21 19:05:44 viola kernel: <3267585> about to schedule - my state: 1 Jan 21 19:05:44 viola kernel: <3154319> about to schedule - my state: 1 Jan 21 19:05:44 viola kernel: <3267585> exception_deliver - host level Jan 21 19:05:44 viola kernel: <3267585> about to schedule - my state: 1 Jan 21 19:05:44 viola kernel: <3154319> about to schedule - my state: 1 Jan 21 19:05:44 viola systemd-journald[338521]: Missed 14 kernel messages ```

Followed by a tone of Missed messages lines, like the last one

At the time, nothing else in /var/log/ has changes in it to show

I was capturing the output to dmesg too:

I was getting trap 27 message for starting at 19:20:49 (19:18:34 kernel time) exactly for <70 seconds before I was able to kill bash

I have about 38 [Thu Jan 21 19:17:46 2021] not implemented: lck_mtx_destroy() messages that all happened before the trap 27 each one corresponds with when I start a darling process (fork). So it doesn't seem to be anything to worry about?

The last process to run before the trap went like so:

dmesg of the last process

``` [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 0000000035588fce [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 000000004509f261 [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 00000000fe97b957 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: execve expand /bin/ls -> /home/andy/.darling/bin/ls, ret 0 ...env var dump... [Thu Jan 21 19:18:33 2021] not implemented: lck_mtx_destroy() [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: darling_sigexc_self() [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 000000006bfeff07 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: darling_sigexc_self() [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 000000006bfeff07 [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 0000000035588fce [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 000000004509f261 [Thu Jan 21 19:18:33 2021] <3271596> - kobject routine: 00000000fe97b957 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: dtype for fd 1 -> /dev/pts/14 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: dtype for fd 1 -> /dev/pts/14 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: dtype for fd 1 -> /dev/pts/14 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: dtype for fd 1 -> /dev/pts/14 [Thu Jan 21 19:18:33 2021] Darling TID 3271596 (PID 3271596) says: dtype for fd 1 -> /dev/pts/14 [Thu Jan 21 19:18:34 2021] Darling TID 3271398 (PID 3271398) says: dtype for fd 255 -> /dev/pts/14 [Thu Jan 21 19:18:36 2021] <3263493> calling clear_wait [Thu Jan 21 19:18:36 2021] <3262551> - kobject routine: 00000000ac8cb31b [Thu Jan 21 19:18:36 2021] <3262551> about to schedule - my state: 1 [Thu Jan 21 19:18:41 2021] <3263493> calling clear_wait [Thu Jan 21 19:18:41 2021] <3262551> - kobject routine: 00000000ac8cb31b [Thu Jan 21 19:18:41 2021] <3262551> about to schedule - my state: 1 [Thu Jan 21 19:18:46 2021] <3263493> calling clear_wait [Thu Jan 21 19:18:46 2021] <3262551> - kobject routine: 00000000ac8cb31b [Thu Jan 21 19:18:46 2021] <3262551> about to schedule - my state: 1 ```

dmesg after last process

``` [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc_handler(2, 0x7FFFF7BAE770, 0x7FFFF7BAE640) [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: have RIP 0x7FFFF7B99161 ... a lot of gregs numbers ... [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: saving to kernel: ... registers ... [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: fcw out: 0x37F [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: xmm0 out: 0x0 [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: restored from kernel: ... registers ... [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: fcw in: 0x37F [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: xmm0 in: 0x0 ... a lot of gregs numbers ... [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc: will forward signal to app handler (0x7FFFF7F4A910) # Another sigexc_handler for 2 [Thu Jan 21 19:18:51 2021] Darling TID 3271398 (PID 3271398) says: sigexc_handler(2, 0x7FFFF7BAD6F0, 0x7FFFF7BAD5C0) ... # After that 27790 repeats for signal 11 [Thu Jan 21 19:18:57 2021] Darling TID 3271398 (PID 3271398) says: sigexc_handler(11, 0x7FFFF7BAE770, 0x7FFFF7BAE640) # with some [Thu Jan 21 19:18:57 2021] <3271398> exception_deliver - host level [Thu Jan 21 19:18:57 2021] <3271398> about to schedule - my state: 1 [Thu Jan 21 19:18:57 2021] <3154319> about to schedule - my state: 1 # At the very end, when I managed to get kill -9 sent to bash [Thu Jan 21 19:19:21 2021] Darling TID 3271397 (PID 3271397) says: sigexc_handler(1, 0x7FFFF7D73770, 0x7FFFF7D73640) ```

There are no other messages, aside from the occasional /dev/kmsg buffer overrun, some messages lost. message. No mention of overlay

For the record, I'm running commands like cat <(echo hi; BASHPID=$(bash -c 'echo ${PPID}'); echo $$ $BASHPID >&2; ls -ls /proc/$BASHPID/fd >&2; ls /proc ) to debug another issue. But this command won't just cause the issue. It takes a lot of just trying/modifying it before it finally traps. I, unfortunately, do not have easily reproducible steps for this :-\

Let me know if there's anything else you would like me to look for that will help.

darlinghq / darling

Unimplemented mach trap (27) #773

Steps to reproduce

Debugging