Duke-GCB / bespin

Reproducible genomic workflows in the cloud
1 stars 0 forks source link

Track and report workflow progress #1

Open dleehr opened 6 years ago

dleehr commented 6 years ago

During testing by bioinformaticians, they've raised multiple inquiries about the status of running workflows - It's been running for 4 days, is it stuck? What step is it on? When will it be done?

Currently, bespin/lando report a handful of high-level job states: creating a vm, downloading data, running workflow, uploading data, terminating VM. These are visible in the API, UI, and CLI, but don't answer the question of how far along the workflow is or what to expect.

We (IT) can login to the VMs and tail a logfile or look at the processes/docker containers running to answer these questions. This answers the question in the short-term, but it's not a viable long-term strategy.

We considered some approaches based on the worker log files (See https://github.com/Duke-GCB/lando/issues/46 for background), but don't want to implement this in a way that depends on a specific CWL engine.

Ideally we could report around a semantic unit like a "sample", rather than a file or workflow step.

Thoughts?

dleehr commented 6 years ago

Sidenote: I filed this issue in the bespin repo rather than a specific component repo (e.g. bespin-api, lando, or bespin-cwl) since it's too early to determine what work to do in which repo.

dleehr commented 6 years ago

One idea to make this info more transparent/accessible to the GAAB core - provide them SSH access to the VMs while workflows are running, for easy status checking. Has some merits and some demerits, but may also help demystify bespin operations

dleehr commented 6 years ago

Idea that I've been prototyping: Annotate workflows with progress-reporting steps. Basically, before and after each major step in the workflow, the step emits a log message that can include meaningful metadata (e.g. the sample name and step name in context). See wrap-progress.sh

These messages are collected by the worker VM running the job and relayed back to bespin-api, where they are parsed/processed into progress.

dleehr commented 6 years ago

I plan to bring this up at tomorrow's CWL community meeting to ask how others are reporting progress for big workflows.

dleehr commented 6 years ago

Discussed on today's CWL community meeting, asking if others had experience with reporting workflow progress. Some notes: