Closed stevelinton closed 3 years ago
Same for me, after entering GASMAN("collect");
in Jupyter:
thread_suspend failed
Abort
0 gap 0x00000001029793f1 BacktraceHandler + 33
1 libsystem_platform.dylib 0x00007fff626bbb5d _sigtramp + 29
2 gap 0x0000000102a30849 ExecReturnObj + 105
3 libsystem_c.dylib 0x00007fff625756a6 abort + 127
4 libgc.1.dylib 0x000000010346b9ba GC_stop_world + 506
5 libgc.1.dylib 0x0000000103455dbe GC_stopped_mark + 46
6 libgc.1.dylib 0x0000000103455d1f GC_try_to_collect_inner + 271
7 libgc.1.dylib 0x000000010345672b GC_try_to_collect_general + 123
8 libgc.1.dylib 0x00000001034567ab GC_gcollect + 11
9 gap 0x0000000102a7db59 CollectBags + 9
10 gap 0x000000010298c205 FuncGASMAN + 165
11 gap 0x000000010299945c IntrFuncCallEnd + 1436
12 gap 0x0000000102a1ee1b EvalRef + 203
13 gap 0x0000000102a1dfb8 ReadCallVarAss + 1368
14 gap 0x0000000102a1d893 ReadAtom + 51
15 gap 0x0000000102a1d5fb ReadFactor + 139
16 gap 0x0000000102a1d45a ReadTerm + 26
17 gap 0x0000000102a1d349 ReadAri + 25
18 gap 0x0000000102a1d191 ReadRel + 129
19 gap 0x0000000102a1cfba ReadAnd + 26
20 gap 0x0000000102a1bc2a ReadExpr + 26
21 gap 0x0000000102a1ba23 ReadEvalCommand + 1139
22 gap 0x0000000102a3178c READ_ALL_COMMANDS + 268
23 gap 0x000000010298a6fc EvalFunccall4args + 876
24 gap 0x0000000102a59ca1 ExecAssLVar + 129
25 gap 0x0000000102a2e024 ExecSeqStat7 + 84
26 gap 0x0000000102a2d5dc EXEC_CURR_FUNC + 60
27 gap 0x0000000102986236 DoExecFunc1args + 438
28 gap 0x0000000102989cf4 EvalFunccall1args + 548
29 gap 0x0000000102988b08 ExecProccall3args + 712
30 gap 0x0000000102a2de64 ExecSeqStat3 + 84
31 gap 0x0000000102a2df44 ExecSeqStat5 + 84
@rbehrends @markuspf if either of you have any idea. It would be really helpful for project demos next week. Both of us are on OS X of course. I'll try on Linux.
Does not happen in the terminal. IO is claimed to be somewhat HPC-GAP compatible at https://github.com/gap-system/gap/wiki/Building-HPC-GAP, but nothing known about other dependencies of JupyterKernel.
What version are you using? I currently have a WIP pull request here where I'm working on outstanding issues with HPC-GAP. I've been testing this PR fairly constantly the past couple of weeks, and while there is a problem with some packages raising guard errors, I haven't encountered crashes. (Note that this version needs to be built with ./configure --enable-hpcgap --enable-guards
for now if you want guards; the default for guards is temporarily off until we are ready to merge the unsafe functions autogenerated by unward
.)
Also, are you using GAP as an application or library? I haven't tested HPC-GAP as a library yet, but part of the work in the PR above is to make this possible (configure option --with-native-tls
). It may still be necessary to register threads manually if you use HPC-GAP as a library.
@rbehrends . Running current GAP master with packages from make bootstrap-pkg-full except for a current git version of ZeroMQInterface. compiled with --enable-hpcgap but nothing about guards.
It's running as an application, using the JupyterKernel
package (which uses ZeroMQInterface
and IO
) to run a Jupyter kernel. That all works, and I get a worksheet in which I can run parallel code and so on, except that when it garbage collects the server crashes with the stack trace above. I'm attempted to test on Linux, and I've got all the necessary versions of things there and compiled and I can load JupyterKernel
, but don't actually have Jupyter installed on that machine, so I still can't test it there. On either system running GAP from a terminal, loading JupyterKernel
and then garbage collecting is fine. The problem only seems to be once it's actually running as a server.
Okay, this looks like it's connected to Jupyter somehow; this is going to take some work to figure out. I'll hope that I'll have at least an idea what's breaking here by tomorrow.
For what it's worth, HPC-GAP should work from the master branch, but it won't have unward support (and therefore no guards). But that should not crash the GC.
A side note on unward: The way it works is that we've put the guards in PTR_BAG()
and CONST_PTR_BAG()
. This way, every bag access is automatically guarded. Unfortunately, this is too strict, because the kernel occasionally needs to ignore region membership. Unward is a tool that recognizes regions of code that shouldn't have guards (bracketed by #ifndef WARD_ENABLED ... #endif
) and rewrites any such code to use alternate versions of the functions that don't have guards.
In short, guards are the default for every object access, and unward
removes them from places where they shouldn't be (i.e. the opposite of what ward
did). This is considerably easier than ward
, as we need only a very simplistic parser and don't have to parse all of C.
For this to work, we need a couple of changes (most importantly, the functions UNSAFE_PTR_BAG()
and UNSAFE_CONST_PTR_BAG()
that don't have guards) that are currently only in the PR mentioned enough, but not yet on the master branch.
As a result, --enable-hpcgap
works on the master branch, but does not have support for unward
and therefore not for guards.
Hmm, wasn't JupyterKernel using fork()
somehow? Maybe I misremember, though... but if not, could that perhaps be related?
It forks a process to answer the heartbeat messages. I wouldn’t expect that one ever to garbage collect,
Steve
Debugging this is still a bit problematic, as I can only reproduce the bug when running from the Jupyter console/notebook, but basically, the key here is the initial message, thread_suspend failed
. This means that during the garbage collection, the GC was unable to pause all the threads. I'm not yet sure why that happens, but that's the underlying cause, not any actual GC problems. There may indeed be an interaction with fork()
here, due to the usual frustrating interactions between fork()
and threads, but I don't know yet for certain.
Update: right now, it looks like fork()
may indeed be at fault here. After a fork()
, only the current thread continues to exist in the forked process, but the GC doesn't know of that. The Boehm GC has hooks for that, so this part is fixable. However, any threads started by GAP will still be dead. Starting GAP with -S
will have one concurrent thread started by default (I think it's a worker thread for tasks), but I think we can do something here to give HPC-GAP a startup option for single-threaded mode to allow fork()
to work.
Update 2: Inserting GC_set_handle_fork(1);
before GC_init();
in src/boehm_gc.c
seems to prevent the crash, but we're still going to need to avoid the dead worker thread. I'll look into that tonight.
Thanks. One of the two forked processes in JupyterKernel has a very limited role (answering the heartbeat) so it doesn’t matter too much if it is missing a worker threads and it will probably never need to GC. The other one does all the work and I want to be able to use HPCGAP functionality there. I’m not sure which is child and which is parent.
Steve
On 24 Oct 2019, at 14:56, Reimer Behrends notifications@github.com wrote:
Update 2: Inserting GC_set_handle_fork(1); before GC_init(); in src/boehm_gc.c seems to prevent the crash, but we're still going to need to avoid the dead worker thread. I'll look into that tonight.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.
[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/gap-packages/JupyterKernel/issues/109?email_source=notifications\u0026email_token=ABQQIRUGSHLALPJOK7SADILQQGSR3A5CNFSM4JEAWR4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECFDW5Y#issuecomment-545930103", "url": "https://github.com/gap-packages/JupyterKernel/issues/109?email_source=notifications\u0026email_token=ABQQIRUGSHLALPJOK7SADILQQGSR3A5CNFSM4JEAWR4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECFDW5Y#issuecomment-545930103", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
I've put up a pull request with the changes here. It is still somewhat raw, but I figured that people would like trying it out anyway.
The thread that was running in -S
mode was actually the signal handler thread, not a worker thread. Worker threads for tasks are only started on demand.
@rbehrends thanks - for me it does not crash any more!
Not crashes any more after @rbehrends's PR - closing now.
This seems to happen whenever my HPCGAP kernel tries to garbage collect (or it might be something like whenever one of the worker threads triggers a garbage collection).