bionode / bionode-watermill

💧Bionode-Watermill: A (Not Yet Streaming) Workflow Engine
https://bionode.gitbooks.io/bionode-watermill/content/
MIT License
37 stars 11 forks source link

How can be a pipeline extended to deal with multiple samples? #47

Closed zhangzhen closed 7 years ago

zhangzhen commented 7 years ago

A pipeline is written to deal with one sample at a time. How can be the pipeline extended to deal with multiple samples?

thejmazz commented 7 years ago

Hi @zhangzhen, thanks for the question!

The pipeline returns a thunk which returns a Promise (and also alternatively accepts a callback), so it would be feasible to create a bunch of pipelines for each sample, and then run all of them (probably in serial rather than parallel for now). Something like

function buildPipeline(sampleId) { ... }

const runs = samples.map(id => buildPipeline(id))

// using bluebird
Promise.mapSeries(runs, (run) => run())
  .then((results) => console.log(results)
  .catch(console.error)

This is not using any watermill specific API. Running a pipeline across multiple samples is something we have thought about before, but the current API so far is a "low level" atomic task one. (join, junction, and fork also follow the task API) I think the best approach for implementing new features is to nail down the task API, and build new features on top of it, try different approaches (I think Observables could be useful for this sort of stuff), and then perhaps later we can move certain things into core.

For example, we've mused with the idea of a stream producing a bunch of sample IDs (perhaps from filtering down a stream from bionode-ncbi) and having that stream of IDs transformed into running pipelines.

zhangzhen commented 7 years ago

Thank you for your reply. There are still questions to be discussed that are listed below:

  1. Multiple samples involves multiple files including output files generated by each task in a pipeline. In the case of multiple samples, does using the glob pattern to resolve the input and output files of each task work properly?
  2. If input data comes from another source, for example, directly from an Illumina sequencer, how is a pipeline constructed?
  3. Can the pipeline be parameterized and used in command line?
  4. If a task is stopped unexpectedly, can it be resumed?

P.S. your blog article "NGS Workflows" is one of the best articles that discuss and compare bioinformatics pipelines. It's very practical! I am encouraged to use your watermill to do my research work. Thank for your excellent work.

thejmazz commented 7 years ago

Thank you very much for the kind remarks :)

  1. Yes, it should work properly; the directory for every task is based on that task's UID which is built from hashes of input, output, and params. The task directory can also be overwritten by the user. See

  2. As long as the source is a file or stream it can work.

  3. Not yet - what do you have in mind for this? It's fairly simple to build node CLIs using yargs or something similar.

  4. You can run it again, and tasks will check if the hash of files found on disk match the stored hashes in watermill's task metadata folder, and will skip over already completed work. However, this is not 100% functional - I think there is the case where if you change parameters from an upstream task, the downstream one won't necessarily be forced to rerun

If you choose to use watermill in practice, please be weary that it may not cover all use cases you desire, there will be bugs, and the tool is undergoing rapid development (we have GSOC 2017 student now!). That being said, feedback, bug reports, suggestions for example pipelines, etc, is extremely useful.