gap-packages / JupyterKernel

Native Jupyter kernel for GAP
https://gap-packages.github.io/JupyterKernel/
BSD 3-Clause "New" or "Revised" License
19 stars 12 forks source link

Garbage collection causes crash in HPCGAP #109

Closed stevelinton closed 3 years ago

stevelinton commented 4 years ago

This seems to happen whenever my HPCGAP kernel tries to garbage collect (or it might be something like whenever one of the worker threads triggers a garbage collection).

thread_suspend failed
Abort
0   gap                                 0x0000000103d32ba1 BacktraceHandler + 33
1   libsystem_platform.dylib            0x00007fff7d9a0b5d _sigtramp + 29
2   ???                                 0x0000000000000000 0x0 + 0
3   libsystem_c.dylib                   0x00007fff7d85a6a6 abort + 127
4   libgc.1.dylib                       0x00000001048249aa GC_stop_world + 506
5   libgc.1.dylib                       0x000000010480edae GC_stopped_mark + 46
6   libgc.1.dylib                       0x000000010480ed0f GC_try_to_collect_inner + 271
7   libgc.1.dylib                       0x000000010480fd35 GC_collect_or_expand + 181
8   libgc.1.dylib                       0x000000010480ff9f GC_allocobj + 271
9   libgc.1.dylib                       0x00000001048165a5 GC_generic_malloc_inner + 309
10  libgc.1.dylib                       0x00000001048176b3 GC_generic_malloc_many + 947
11  libgc.1.dylib                       0x0000000104817789 GC_malloc_many + 41
12  gap                                 0x0000000103e3757c AllocateBagMemory + 252
13  gap                                 0x0000000103e37649 NewBag + 73
14  gap                                 0x0000000103d2c1b8 Cyclotomic + 2232
15  gap                                 0x0000000103e34b79 FuncPROD_VECTOR_MATRIX + 1049
16  gap                                 0x0000000103d43788 EvalFunccall2args + 632
17  gap                                 0x0000000103de9ff9 ExecReturnObj + 105
18  gap                                 0x0000000103de74d4 ExecSeqStat + 100
19  gap                                 0x0000000103de6d8c EXEC_CURR_FUNC + 60
20  gap                                 0x0000000103d3f9e6 DoExecFunc1args + 438
21  gap                                 0x0000000103d434a4 EvalFunccall1args + 548
22  gap                                 0x0000000103e142ad ExecAssList + 317
23  gap                                 0x0000000103de8a9b ExecForRange + 555
24  gap                                 0x0000000103de76f4 ExecSeqStat5 + 84
25  gap                                 0x0000000103de75a4 ExecSeqStat2 + 84
26  gap                                 0x0000000103de76f4 ExecSeqStat5 + 84
27  gap                                 0x0000000103de6d8c EXEC_CURR_FUNC + 60
28  gap                                 0x0000000103d3f9e6 DoExecFunc1args + 438
29  gap                                 0x0000000103d43788 EvalFunccall2args + 632
30  gap                                 0x0000000103e13451 ExecAssLVar + 129
31  gap                                 0x0000000103de7684 ExecSeqStat4 + 84
olexandr-konovalov commented 4 years ago

Same for me, after entering GASMAN("collect"); in Jupyter:

thread_suspend failed
Abort
0   gap                                 0x00000001029793f1 BacktraceHandler + 33
1   libsystem_platform.dylib            0x00007fff626bbb5d _sigtramp + 29
2   gap                                 0x0000000102a30849 ExecReturnObj + 105
3   libsystem_c.dylib                   0x00007fff625756a6 abort + 127
4   libgc.1.dylib                       0x000000010346b9ba GC_stop_world + 506
5   libgc.1.dylib                       0x0000000103455dbe GC_stopped_mark + 46
6   libgc.1.dylib                       0x0000000103455d1f GC_try_to_collect_inner + 271
7   libgc.1.dylib                       0x000000010345672b GC_try_to_collect_general + 123
8   libgc.1.dylib                       0x00000001034567ab GC_gcollect + 11
9   gap                                 0x0000000102a7db59 CollectBags + 9
10  gap                                 0x000000010298c205 FuncGASMAN + 165
11  gap                                 0x000000010299945c IntrFuncCallEnd + 1436
12  gap                                 0x0000000102a1ee1b EvalRef + 203
13  gap                                 0x0000000102a1dfb8 ReadCallVarAss + 1368
14  gap                                 0x0000000102a1d893 ReadAtom + 51
15  gap                                 0x0000000102a1d5fb ReadFactor + 139
16  gap                                 0x0000000102a1d45a ReadTerm + 26
17  gap                                 0x0000000102a1d349 ReadAri + 25
18  gap                                 0x0000000102a1d191 ReadRel + 129
19  gap                                 0x0000000102a1cfba ReadAnd + 26
20  gap                                 0x0000000102a1bc2a ReadExpr + 26
21  gap                                 0x0000000102a1ba23 ReadEvalCommand + 1139
22  gap                                 0x0000000102a3178c READ_ALL_COMMANDS + 268
23  gap                                 0x000000010298a6fc EvalFunccall4args + 876
24  gap                                 0x0000000102a59ca1 ExecAssLVar + 129
25  gap                                 0x0000000102a2e024 ExecSeqStat7 + 84
26  gap                                 0x0000000102a2d5dc EXEC_CURR_FUNC + 60
27  gap                                 0x0000000102986236 DoExecFunc1args + 438
28  gap                                 0x0000000102989cf4 EvalFunccall1args + 548
29  gap                                 0x0000000102988b08 ExecProccall3args + 712
30  gap                                 0x0000000102a2de64 ExecSeqStat3 + 84
31  gap                                 0x0000000102a2df44 ExecSeqStat5 + 84
stevelinton commented 4 years ago

@rbehrends @markuspf if either of you have any idea. It would be really helpful for project demos next week. Both of us are on OS X of course. I'll try on Linux.

olexandr-konovalov commented 4 years ago

Does not happen in the terminal. IO is claimed to be somewhat HPC-GAP compatible at https://github.com/gap-system/gap/wiki/Building-HPC-GAP, but nothing known about other dependencies of JupyterKernel.

rbehrends commented 4 years ago

What version are you using? I currently have a WIP pull request here where I'm working on outstanding issues with HPC-GAP. I've been testing this PR fairly constantly the past couple of weeks, and while there is a problem with some packages raising guard errors, I haven't encountered crashes. (Note that this version needs to be built with ./configure --enable-hpcgap --enable-guards for now if you want guards; the default for guards is temporarily off until we are ready to merge the unsafe functions autogenerated by unward.)

Also, are you using GAP as an application or library? I haven't tested HPC-GAP as a library yet, but part of the work in the PR above is to make this possible (configure option --with-native-tls). It may still be necessary to register threads manually if you use HPC-GAP as a library.

stevelinton commented 4 years ago

@rbehrends . Running current GAP master with packages from make bootstrap-pkg-full except for a current git version of ZeroMQInterface. compiled with --enable-hpcgap but nothing about guards. It's running as an application, using the JupyterKernel package (which uses ZeroMQInterface and IO) to run a Jupyter kernel. That all works, and I get a worksheet in which I can run parallel code and so on, except that when it garbage collects the server crashes with the stack trace above. I'm attempted to test on Linux, and I've got all the necessary versions of things there and compiled and I can load JupyterKernel, but don't actually have Jupyter installed on that machine, so I still can't test it there. On either system running GAP from a terminal, loading JupyterKernel and then garbage collecting is fine. The problem only seems to be once it's actually running as a server.

rbehrends commented 4 years ago

Okay, this looks like it's connected to Jupyter somehow; this is going to take some work to figure out. I'll hope that I'll have at least an idea what's breaking here by tomorrow.

For what it's worth, HPC-GAP should work from the master branch, but it won't have unward support (and therefore no guards). But that should not crash the GC.


A side note on unward: The way it works is that we've put the guards in PTR_BAG() and CONST_PTR_BAG(). This way, every bag access is automatically guarded. Unfortunately, this is too strict, because the kernel occasionally needs to ignore region membership. Unward is a tool that recognizes regions of code that shouldn't have guards (bracketed by #ifndef WARD_ENABLED ... #endif) and rewrites any such code to use alternate versions of the functions that don't have guards.

In short, guards are the default for every object access, and unward removes them from places where they shouldn't be (i.e. the opposite of what ward did). This is considerably easier than ward, as we need only a very simplistic parser and don't have to parse all of C.

For this to work, we need a couple of changes (most importantly, the functions UNSAFE_PTR_BAG() and UNSAFE_CONST_PTR_BAG() that don't have guards) that are currently only in the PR mentioned enough, but not yet on the master branch.

As a result, --enable-hpcgap works on the master branch, but does not have support for unward and therefore not for guards.

fingolfin commented 4 years ago

Hmm, wasn't JupyterKernel using fork() somehow? Maybe I misremember, though... but if not, could that perhaps be related?

stevelinton commented 4 years ago

It forks a process to answer the heartbeat messages. I wouldn’t expect that one ever to garbage collect,

Steve
rbehrends commented 4 years ago

Debugging this is still a bit problematic, as I can only reproduce the bug when running from the Jupyter console/notebook, but basically, the key here is the initial message, thread_suspend failed. This means that during the garbage collection, the GC was unable to pause all the threads. I'm not yet sure why that happens, but that's the underlying cause, not any actual GC problems. There may indeed be an interaction with fork() here, due to the usual frustrating interactions between fork() and threads, but I don't know yet for certain.

rbehrends commented 4 years ago

Update: right now, it looks like fork() may indeed be at fault here. After a fork(), only the current thread continues to exist in the forked process, but the GC doesn't know of that. The Boehm GC has hooks for that, so this part is fixable. However, any threads started by GAP will still be dead. Starting GAP with -S will have one concurrent thread started by default (I think it's a worker thread for tasks), but I think we can do something here to give HPC-GAP a startup option for single-threaded mode to allow fork() to work.

rbehrends commented 4 years ago

Update 2: Inserting GC_set_handle_fork(1); before GC_init(); in src/boehm_gc.c seems to prevent the crash, but we're still going to need to avoid the dead worker thread. I'll look into that tonight.

stevelinton commented 4 years ago

Thanks. One of the two forked processes in JupyterKernel has a very limited role (answering the heartbeat) so it doesn’t matter too much if it is missing a worker threads and it will probably never need to GC. The other one does all the work and I want to be able to use HPCGAP functionality there. I’m not sure which is child and which is parent.

Steve

On 24 Oct 2019, at 14:56, Reimer Behrends notifications@github.com wrote:

Update 2: Inserting GC_set_handle_fork(1); before GC_init(); in src/boehm_gc.c seems to prevent the crash, but we're still going to need to avoid the dead worker thread. I'll look into that tonight.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

[ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/gap-packages/JupyterKernel/issues/109?email_source=notifications\u0026email_token=ABQQIRUGSHLALPJOK7SADILQQGSR3A5CNFSM4JEAWR4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECFDW5Y#issuecomment-545930103", "url": "https://github.com/gap-packages/JupyterKernel/issues/109?email_source=notifications\u0026email_token=ABQQIRUGSHLALPJOK7SADILQQGSR3A5CNFSM4JEAWR4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECFDW5Y#issuecomment-545930103", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

rbehrends commented 4 years ago

I've put up a pull request with the changes here. It is still somewhat raw, but I figured that people would like trying it out anyway.

The thread that was running in -S mode was actually the signal handler thread, not a worker thread. Worker threads for tasks are only started on demand.

olexandr-konovalov commented 4 years ago

@rbehrends thanks - for me it does not crash any more!

olexandr-konovalov commented 3 years ago

Not crashes any more after @rbehrends's PR - closing now.