Closed timmytofu closed 8 years ago
Hi!
Please forgive me for the delay answering, I have a couple of questions, does the process segfault (dumps a core) or stops with an exit code? Does this happen right away when starting the process (for the snippet you are sending me I understand so)?
Right now the only idea that comes to my mind would be to strace the process.
It stops with exit code 139.
It doesn't happen right away, only when hitting the snippet above (which is called as part of a snap application when a certain endpoint is hit).
I will try to get more info when I'm back in physical proximity to that machine.
I'm seeing this too, here's an strace:
[{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 0, NULL) = 17213
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=17213, si_status=SIGSEGV, si_utime=287, si_stime=24} ---
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
rt_sigaction(SIGINT, {0xfe9740, [], SA_RESTORER|SA_SIGINFO, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
timer_settime(0, 0, {it_interval={0, 10000000}, it_value={0, 10000000}}, NULL) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {0, 858856499}) = 0
write(8, "\376", 1) = 1
futex(0x20e90dc, FUTEX_WAIT_PRIVATE, 35, NULL) = 0
futex(0x20e9108, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x20e91fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x20e91f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x20e9228, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1445e60, FUTEX_WAKE_PRIVATE, 1) = 1
sched_yield() = 0
timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigaction(SIGVTALRM, {SIG_IGN, [], SA_RESTORER|SA_INTERRUPT|SA_NODEFER|SA_RESETHAND, 0x7f763abb2d40}, {0xfe14d0, [], SA_RESTORER|SA_RESTART, 0x7f763b4cb340}, 8) = 0
timer_delete(0) = 0
rt_sigprocmask(SIG_BLOCK, [TTOU], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigaction(SIGTSTP, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {0, 859160451}) = 0
rt_sigaction(SIGSEGV, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
kill(17004, SIGSEGV) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=17004, si_uid=1000} ---
+++ killed by SIGSEGV +++
This appears to only happen when using Snap's dynamic loader.
Thank you for the strace output and for the insight about Snap, unfortunately I can't get more info from it, I think in this case the -f flag would be very helpful, it makes strace trace child processes too, if I understand this data correctly this log is saying a child of this process has been killed by a segfault but it doesn't show the strace of the child in question. If you can reproduce it easily with this flag it would be perfect, however it's pretty possible I won't get any further info either.
I've never used Snap before I will try to find some time to try to reproduce it but if you can give me an easy repro setup it will be greatly appreciated. I would try to get the core dump when the segfault occurs and hope I can get anything from it which could be the case if it points to a FFI'ed c library or something, however if it points to GHC's runtime internals I would be out of luck because I have no idea about it.
Do you have any other ideas about how we could debug this further?
For now I will close this, I haven't been able to reproduce it, if it's still a problem and have any suggestions on how to further debug it I will take a look at it again.
I know this isn't the most informative issue, but one of my services is consistently exiting with code 139 when doing anything in
withConnection host port . runTransaction
, even justit's only happening in one service and not the others, but they're all in a shared sandbox and comparing the packages loaded between the two, the one that's not working has everything the working one does, same versions (the working one has three additional, unrelated packages).
Any ideas you have as far as debugging would be welcome.