flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 49 forks source link

should we implement remote gdb? #2611

Closed garlick closed 4 years ago

garlick commented 4 years ago

I haven't been tracking the debugger integration, but did have one thought. Would it be useful as a sort of baseline debugger support or test case to offer remote gdb? The "gdbserver" is usually used with embedded systems over serial, but it seems like it could easily work under flux - maybe as a shell service?

https://sourceware.org/gdb/current/onlinedocs/gdb/Remote-Debugging.html#Remote-Debugging

dongahn commented 4 years ago

Sounds reasonable as a long term goal.

IMHO, our priority at this point should be to support our own HPC toolset -- more parallel tools like totalview, DDT and STAT. Cray also has gdb4HPC, valgrind4HPC, comparative debugger etc which I plan to enable by porting Cray's common tools interface (CTI) to Flux.

grondo commented 4 years ago

For testing, we might not need remote debugger support since all processes are running locally for our test_under_flux cases. Also, it seems like ptrace attach to job tasks is outside the purview of flux testing, since this is something implemented by parallel debuggers.

What we do need to test with Flux is that the MPIR support works, and we do have some builtin tests for that that do not require actually attaching to the remote tasks.

Does gdb already support parallel debugging or would this feature only work for singleton jobs?

dongahn commented 4 years ago

Does gdb already support parallel debugging or would this feature only work for singleton jobs?

Gdb has multi-process debugging mode. But this probably only support single node processes though.

The true parallel debugging support would be through Arm Allinea DDT.

grondo commented 4 years ago

Ah, I do think DDT uses gdbserver, so maybe we can eventually make the startup more efficient by implementing @garlick's idea?

dongahn commented 4 years ago

Also, it seems like ptrace attach to job tasks is outside the purview of flux testing, since this is something implemented by parallel debuggers.

I can imagine we can have a tester program that is co-located with the MPI processes with stop-tasks-in-exec and check if the state of the tasks are "ptrace stop" and continue the processes with SIGCONT.

dongahn commented 4 years ago

Ah, I do think DDT uses gdbserver, so maybe we can eventually make the startup more efficient by implementing @garlick's idea?

Not sure if they use gdb or gdbserver. When I checked this way back, they used gdb. This should be discussed with Alinea DDT separately, I think as opposed to making decision by ourselves.

grondo commented 4 years ago

I can imagine we can have a tester program that is co-located with the MPI processes with stop-tasks-in-exec and check if the state of the tasks are "ptrace stop" and continue the processes with SIGCONT.

This is already essentially done in the shell mpir tests.

dongahn commented 4 years ago

@grondo: Great! One less work to do for this PR :-)

garlick commented 4 years ago

I'm happy to close this if is not something useful to investigate in the near term and we want to drive our work with actual use cases. Give me a thumbs up if this should be closed :-)

garlick commented 4 years ago

Thanks!