emorice / galp

Incremental distributed python runner
MIT License
0 stars 0 forks source link

Debugging support #90

Closed emorice closed 1 month ago

emorice commented 1 year ago

Today I had a segfault in a step, dependent on the actual arguments of the steps. There is currently simply no way to debug this. At the very least, I need a way to inspect the arguments of any step that has failed. This problem had occurred before when I had only some instances of a step getting oom'd.

By contrast, even nextflow can handle that because you can navigate to a task's folder, inspect the linked files, and re-run the step in isolation ; and we use that feature a lot. Here we're comparatively stuck.

As a lead to go forward, one of the main pain is that we can't access the worker since it's in the background ; but if we raised an exception with the whole task information in the client on error, it could be possible to drop to a debugger and call galp.run or whatever in the prompt to inspect args.

Also, while the short names are good in logs, on errors we absolutely need the full names so that we can use them to write debugging code.

We could consider dumping the arguments on error ; too. The main issue is that it could be huge ; but we can have guards, we can dump to a standalone file or whatever.

In summary we lack, by order of priority:

  1. A way to inspect the arguments of the task that failed.
  2. A way to selectively re-run a task that failed in isolation.
  3. Integration with the debugger on the client.
emorice commented 1 month ago

Isolated re-run was implemented in bacc044db0f1c3292c5e1f5194795639e11fe141. Example use with a profiler was shown in 36b9b5f0ea7bf0e5c5016c8960e1520481111f93, debugging can be done in the same way. We also save logs since 7231c4e72928356d33b768ae68e642b1101fb249.