Segfaults - Githubissues

timmytofu commented 9 years ago

I know this isn't the most informative issue, but one of my services is consistently exiting with code 139 when doing anything in withConnection host port . runTransaction, even just

withConnection host port . runTransaction $ return 1

it's only happening in one service and not the others, but they're all in a shared sandbox and comparing the packages loaded between the two, the one that's not working has everything the working one does, same versions (the working one has three additional, unrelated packages).

Any ideas you have as far as debugging would be welcome.

asilvestre commented 9 years ago

Hi!

Please forgive me for the delay answering, I have a couple of questions, does the process segfault (dumps a core) or stops with an exit code? Does this happen right away when starting the process (for the snippet you are sending me I understand so)?

Right now the only idea that comes to my mind would be to strace the process.

timmytofu commented 9 years ago

It stops with exit code 139.

It doesn't happen right away, only when hitting the snippet above (which is called as part of a snap application when a certain endpoint is hit).

I will try to get more info when I'm back in physical proximity to that machine.

benweitzman commented 9 years ago

I'm seeing this too, here's an strace:

[{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 0, NULL) = 17213
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=17213, si_status=SIGSEGV, si_utime=287, si_stime=24} ---
rt_sigprocmask(SIG_BLOCK, [INT], [], 8) = 0
rt_sigaction(SIGINT, {0xfe9740, [], SA_RESTORER|SA_SIGINFO, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [QUIT], [], 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
timer_settime(0, 0, {it_interval={0, 10000000}, it_value={0, 10000000}}, NULL) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {0, 858856499}) = 0
write(8, "\376", 1)                     = 1
futex(0x20e90dc, FUTEX_WAIT_PRIVATE, 35, NULL) = 0
futex(0x20e9108, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x20e91fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x20e91f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x20e9228, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x1445e60, FUTEX_WAKE_PRIVATE, 1) = 1
sched_yield()                           = 0
timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
rt_sigaction(SIGVTALRM, {SIG_IGN, [], SA_RESTORER|SA_INTERRUPT|SA_NODEFER|SA_RESETHAND, 0x7f763abb2d40}, {0xfe14d0, [], SA_RESTORER|SA_RESTART, 0x7f763b4cb340}, 8) = 0
timer_delete(0)                         = 0
rt_sigprocmask(SIG_BLOCK, [TTOU], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigaction(SIGTSTP, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {0, 859160451}) = 0
rt_sigaction(SIGSEGV, {SIG_DFL, [], SA_RESTORER, 0x7f763b4cb340}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
kill(17004, SIGSEGV)                    = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=17004, si_uid=1000} ---
+++ killed by SIGSEGV +++

benweitzman commented 9 years ago

This appears to only happen when using Snap's dynamic loader.

asilvestre commented 9 years ago

Thank you for the strace output and for the insight about Snap, unfortunately I can't get more info from it, I think in this case the -f flag would be very helpful, it makes strace trace child processes too, if I understand this data correctly this log is saying a child of this process has been killed by a segfault but it doesn't show the strace of the child in question. If you can reproduce it easily with this flag it would be perfect, however it's pretty possible I won't get any further info either.

I've never used Snap before I will try to find some time to try to reproduce it but if you can give me an easy repro setup it will be greatly appreciated. I would try to get the core dump when the segfault occurs and hope I can get anything from it which could be the case if it points to a FFI'ed c library or something, however if it points to GHC's runtime internals I would be out of luck because I have no idea about it.

Do you have any other ideas about how we could debug this further?

asilvestre commented 8 years ago

For now I will close this, I haven't been able to reproduce it, if it's still a problem and have any suggestions on how to further debug it I will take a look at it again.

asilvestre / haskell-neo4j-rest-client

Segfaults #20