Open facekapow opened 4 years ago
I've seen this as well, but IIRC there were also some weird lines in the kernel log - something about overlay and possible corruption. Next time please check you dmesg
too.
Just happened to me too, was just exiting and then suddenly my shell was completely filled with unimplemented mach traps (27).
I got endless scrolling of this error message in my invocation of
darling shell
$ python ‘<python program with errors>’
Ctrl-C wouldn’t stop it so I killed the darling process. But not before I filled my partition with roughly equal parts /var/log/kern.log and /var/log/syslog—58GB each!
kern.log messages of the form:
May 30 17:09:17 eureka kernel: [123448.380390] not implemented: lck_mtx_destroy()
May 30 17:09:17 eureka kernel: [123448.383211] Darling TID 50115 (PID 50115) says: darling_sigexc_self()
May 30 17:09:17 eureka kernel: [123448.383335] <50115> - kobject routine: _Xhost_info+0x0/0x100 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.395592] Darling TID 50115 (PID 50115) says: darling_sigexc_self()
May 30 17:09:17 eureka kernel: [123448.395678] <50115> - kobject routine: _Xhost_info+0x0/0x100 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.395842] <50115> - kobject routine: _Xhost_get_clock_service+0x0/0x120 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.395864] <50115> - kobject routine: _Xsemaphore_create+0x0/0x130 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.396397] <50115> - kobject routine: _Xtask_get_special_port+0x0/0xf0 [darling_mach]
May 30 17:09:17 eureka kernel: [123448.397590] Darling TID 50115 (PID 50115) says: dtype for fd 0 -> /dev/pts/5
May 30 17:09:17 eureka kernel: [123448.397597] Darling TID 50115 (PID 50115) says: dtype for fd 2 -> /dev/pts/5
[This happens maybe 10 times over the next 8:17 till all hell breaks loose at 17:17:34 with:]
[A——————————————————]
May 30 17:17:34 eureka kernel: [123945.656520] Darling TID 50115 (PID 50115) says: sigexec_handler(2, 0x7FFFF7D3C770, 0x7FFFF7D3C640)
[followed by more lines of the form]
May 30 17:17:34 eureka kernel: [123945.656528] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
have RIP 0x7FFFF7D27161
gregs 0x0
gregs 0x0
gregs 0x0
gregs 0x246
gregs 0x4025C0
gregs 0x7FFF0C0C2E70
gregs 0x0
gregs 0x0
gregs 0x0
gregs 0x7FFFFFDFCE0B
gregs 0x7FFFFFDFCDA0
gregs 0xCA299051359E6BAF
gregs 0x1
gregs 0x0
gregs 0x7FFFF7D27163
gregs 0x7FFFFFDFCD58
gregs 0x7FFFF7D27161
gregs 0x246
gregs 0x2B000000000033
gregs 0x0
gregs 0x0
gregs 0x0
gregs 0x0
saving to kernel:
rip: 0x7FFFF7D27161
efl: 0x246
rax: 0x0
rbx: 0xCA299051359E6BAF
rcx: 0x7FFFF7D27163
rdx: 0x1
rdi: 0x0
rsi: 0x7FFFFFDFCE0B
rbp: 0x7FFFFFDFCDA0
rsp: 0x7FFFFFDFCD58
fcw out: 0x37F
xmm0 out: 0xFF0000
restored from kernel:
rip: 0x7FFFF7D27161
efl: 0x246
rax: 0x0
rbx: 0xCA299051359E6BAF
rcx: 0x7FFFF7D27163
rdx: 0x1
rdi: 0x0
rsi: 0x7FFFFFDFCE0B
rbp: 0x7FFFFFDFCDA0
rsp: 0x7FFFFFDFCD58
fcw in: 0x37F
xmm0 in: 0xFF0000
gregs 0x0
gregs 0x0
gregs 0x0
gregs 0x246
gregs 0x4025C0
gregs 0x7FFF0C0C2E70
gregs 0x0
gregs 0x0
gregs 0x0
gregs 0x7FFFFFDFCE0B
gregs 0x7FFFFFDFCDA0
gregs 0xCA299051359E6BAF
gregs 0x1
gregs 0x0
gregs 0x7FFFF7D27163
gregs 0x7FFFFFDFCD58
gregs 0x7FFFF7D27161
gregs 0x246
gregs 0x2B000000000033
gregs 0x0
gregs 0x0
gregs 0x0
gregs 0x0
will forward signal to app handler (0x7FFFF7F4A920)
[A——————————————————]
[Section A repeats in 448μs]
[Near the end of the log it looks like.]
May 30 17:42:47 eureka kernel: [125458.460311] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
saving to kernel:
rip: 0x7FFFF7D2714C
efl: 0x10202
rax: 0x203
rbx: 0xCA299051359E6BAF
rcx: 0x0
rdx: 0xF7D3A550
rdi: 0x1000002
rsi: 0x7FFFF7D3A440
rbp: 0x7FFFFFDFDBE0
rsp: 0x7FFFFFDFDB68
fcw out: 0x37F
xmm0 out: 0x0
[Then Linux kernel messages with following form:]
May 30 17:42:47 eureka kernel: [125458.460347] <message>
[where message is]
<50115> exception_deliver - host level
<50115> about to schedule - my state: 1
<2257> about to schedule - my state: 1
May 30 17:42:47 eureka kernel: [125458.460367] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
restored from kernel:
rip: 0x7FFFF7D2714C
efl: 0x10202
rax: 0x203
rbx: 0xCA299051359E6BAF
rcx: 0x0
rdx: 0xF7D3A550
rdi: 0x1000002
rsi: 0x7FFFF7D3A440
rbp: 0x7FFFFFDFDBE0
rsp: 0x7FFFFFDFDB68
fcw in: 0x37F
xmm0 in: 0x0
gregs 0x0
gregs 0x7FFFF7D2713B
gregs 0x307
gregs 0x4025C0
gregs 0x7FFF0C0C2E70
gregs 0x0
gregs 0x0
gregs 0x1000002
gregs 0x7FFFF7D3A440
gregs 0x7FFFFFDFDBE0
gregs 0xCA299051359E6BAF
gregs 0xF7D3A550
gregs 0x203
gregs 0x0
gregs 0x7FFFFFDFDB68
gregs 0x7FFFF7D2714C
gregs 0x10202
gregs 0x2B000000000033
gregs 0x0
gregs 0xD
gregs 0x10002
gregs 0x0
will forward signal to app handler (0x7FFFF7EDEBF0)
handler (11) returning
[then a line]
May 30 17:42:47 eureka kernel: [125458.460431] Darling TID 50115 (PID 50115) says: sigexec_handler(11, 0x7FFFF7D3C770, 0x7FFFF7D3C640)
[followed by more lines of the form]
May 30 17:42:47 eureka kernel: [125458.460433] Darling TID 50115 (PID 50115) says: sigexec: <message>
[where message is]
have RIP 0x7FFFF7D2714C
gregs 0x0
[etc., etc., etc., ad nauseum]
syslog is dominated by lines of the same form as kern.log
dmesg doesn’t show anything obvious to me. What should I be looking for in dmesg?
From what I've seen, it looks like when we have an unimplemented syscall [in my case lck_mtx_destroy()] it gets mapped to calling trap 27 which tries to log the call stack and the name of the function that needs to be implemented.
Is there anything else any of you would like to see from my /var/log/kern.log? I'm happy to keep the evidence around as long as it is needed. I'd also like to recover the 58GB of disk space it is currently occupying.
lck_mtx_destroy() is not a syscall, it's just an unimplemented function in the LKM.
You could try looking for any overlay-related error messages in the log.
I plead ignorance of how I should look for an "overlay-related error message". Here is what I tried so far:
grep -i error /var/log/kern.log.1 >kern_errors.txt
grep -i overlay /var/log/kern.log.1 >kern_overlay.txt
grep -i '\.ko' /var/log/kern.log.1 >kern_ko.txt
ls -l kern_*.txt
440 May 31 10:52 kern_errors.txt
0 May 31 10:57 kern_ko.txt
0 May 31 10:53 kern_overlay.txt
wc -l kern_errors.txt
4 kern_errors.txt
cat kern_errors.txt
May 24 19:54:09 eureka kernel: [269545.799653] ata1: SError: { PHYRdyChg CommWake }
May 27 03:11:57 eureka kernel: [468611.790743] traps: Chrome_ChildIOT[166766] trap invalid opcode ip:55dc46a44ab5 sp:7f5231b7dee0 error:0 in chrome[55dc43d78000+7858000]
May 28 18:38:26 eureka kernel: [ 0.505540] RAS: Correctable Errors collector initialized.
May 30 18:28:40 eureka kernel: [ 0.533632] RAS: Correctable Errors collector initialized.
I'm open for suggestions on how and where to look.
I'm hitting this fairly often too.
I'm able to stop it by killing the bash pid (sometimes have to kill two bash pids) from linux
At the time of the last one happening, this is what is in my /var/log/messages
Followed by a tone of Missed messages lines, like the last one
At the time, nothing else in /var/log/
has changes in it to show
I was capturing the output to dmesg too:
I was getting trap 27 message for starting at 19:20:49 (19:18:34 kernel time) exactly for <70 seconds before I was able to kill bash
I have about 38
[Thu Jan 21 19:17:46 2021] not implemented: lck_mtx_destroy()
messages that all happened before the trap 27
each one corresponds with when I start a darling process (fork). So it doesn't seem to be anything to worry about?
The last process to run before the trap went like so:
There are no other messages, aside from the occasional /dev/kmsg buffer overrun, some messages lost.
message. No mention of overlay
For the record, I'm running commands like cat <(echo hi; BASHPID=$(bash -c 'echo ${PPID}'); echo $$ $BASHPID >&2; ls -ls /proc/$BASHPID/fd >&2; ls /proc )
to debug another issue. But this command won't just cause the issue. It takes a lot of just trying/modifying it before it finally traps. I, unfortunately, do not have easily reproducible steps for this :-\
Let me know if there's anything else you would like me to look for that will help.
This is a very weird error that I'm getting with the latest code on master. It happens seemingly randomly whenever I'm in a bash shell, and it causes my entire container to crash.
Steps to reproduce
This is a pretty difficult bug to reproduce, as it seems to happen randomly. Basically, just open a Darling shell and run some commands.
Debugging
It seems that the Mach trap table does have an entry for trap 27:
https://github.com/darlinghq/darling/blob/3c7558316c7cb29ce24ecbb36efe4c7fcf43a1a2/src/kernel/emulation/linux/mach/mach_table.c#L22
...so I don't why it would be "unimplemented". The code that checks the table is here:
https://github.com/darlinghq/darling/blob/3c7558316c7cb29ce24ecbb36efe4c7fcf43a1a2/src/kernel/emulation/linux/mach/darling_mach_syscall.S#L7-L32
...and to me, it looks perfectly fine: it loads the table address into
r10
, adds the offset inrax
multiplied by 8, and tests if it'sNULL
/0
. If it is, then it calls___unknown_mach_syscall
. However, I can't see why entry 27 would beNULL
in the syscall table.I haven't been able to run
xtrace
orlldb
on this yet because I haven't been able to reproduce it when I want to reproduce it.