DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
901 stars 240 forks source link

Improve debugging experience #3060

Open adamnovak opened 4 years ago

adamnovak commented 4 years ago

Right now, to debug Toil jobs that don't work, you are limited to:

  1. Running your workflow locally at small scale.
  2. Reading the logs from failing large-scale runs to try and identify problems.
  3. Rolling your own solution to make replicating problems observed at scale easier (like toil-vg's dumping of files sent to failing child processes to its outstore).

We would like this to be easier; diekhans wants to be able to easily reproduce and fix a segfauling command run inside a Toil job inside of a 5-day-long Cactus workflow.

There are several levels of goodness we could implement here:

  1. Document toil debug-job, which is able to download and locally run a flaky job given the job store and the job's ID. None of the toil devs actually know much about it. There should be a debugging or troubleshooting section in the docs, maybe under "Developing Workflows", that covers it. Maybe Toil's end-of-failing-run message could even suggest to use it.
  2. toil debug-job appears to run the normal worker, meaning it's going to put its temporary files in the normal work directory and delete them when it is done. If we want to rerun a failing subprocess, we might want to make it (at least by default) put the work directory in/under the current directory, and leave it behind when the job fails for user inspection.
  3. We could possibly have some machinery to walk the input pickle for a job, identify the files it has access to, and export them to a sensible directory structure. Then you could walk through what the job is going to do with the files handy.
  4. It would be nice if Toil were smarter about detecting and reporting failing subprocess calls, as well as failing jobs. We could more or less upstream toil-vg's dumping of subprocess input files, and glue no-container and Singularity support onto Toil's docker-calling system (and/or hook the subprocess module?) so we know when external processes are called. Then when one fails, we could upload all its inputs to the file store, and save an incident report that describes what we tried to run, what the inputs were, and that it didn't work. Then we'd have a bit of machinery to dump the incident reports (maybe all together at once, as well as in the logs?), or to rehydrate one by setting up the input files again so that the user can debug just that external command on their local machine, instead of the whole Toil job.
  5. Back to toil debug-job, if it were able to work for jobs that need services (by starting the services), or if it were even just able to report out how to manually start necessary services, that could help with Cactus debugging, because some Cactus jobs we want to debug actually use the service system.

┆Issue is synchronized with this Jira Epic ┆Epic: Improve debugging experience ┆Issue Number: TOIL-552

adamnovak commented 4 years ago

Related issues:

mr-c commented 4 years ago

--debugWorker flag causes the job to restart infinitely #2739

Maybe we can drop --debugWorker in favor of toil debug-job only?

does toil debug-job work with toil-cwl-runner ?

adamnovak commented 4 years ago

It should. toil-cwl-runner proceces Toil jobs in the job store, and I don't think its leader does any special support work fore those jobs while they run. You'd still need to get the Toil job store ID of the job you want to debug, maybe from a failure message, instead of being able to just run all the jobs in-process fishing for failures, though.

DailyDreaming commented 4 years ago

Maybe we can drop --debugWorker in favor of toil debug-job only?

The --debugWorker flag allows debugging in pycharm and with pdb. It's something I often use, particularly with toil-cwl-runner on whole workflows. I'd argue that it's pretty important to keep.

If I don't use it, I can't set break points in the cwl library files. Adding CWL support for new versions would be very difficult without it.