Closed zhangzhen closed 7 years ago
Hi @zhangzhen, thanks for the question!
The pipeline returns a thunk which returns a Promise (and also alternatively accepts a callback), so it would be feasible to create a bunch of pipelines for each sample, and then run all of them (probably in serial rather than parallel for now). Something like
function buildPipeline(sampleId) { ... }
const runs = samples.map(id => buildPipeline(id))
// using bluebird
Promise.mapSeries(runs, (run) => run())
.then((results) => console.log(results)
.catch(console.error)
This is not using any watermill
specific API. Running a pipeline across multiple samples is something we have thought about before, but the current API so far is a "low level" atomic task one. (join
, junction
, and fork
also follow the task
API) I think the best approach for implementing new features is to nail down the task
API, and build new features on top of it, try different approaches (I think Observable
s could be useful for this sort of stuff), and then perhaps later we can move certain things into core.
For example, we've mused with the idea of a stream producing a bunch of sample IDs (perhaps from filtering down a stream from bionode-ncbi
) and having that stream of IDs transformed into running pipelines.
Thank you for your reply. There are still questions to be discussed that are listed below:
P.S. your blog article "NGS Workflows" is one of the best articles that discuss and compare bioinformatics pipelines. It's very practical! I am encouraged to use your watermill to do my research work. Thank for your excellent work.
Thank you very much for the kind remarks :)
Yes, it should work properly; the directory for every task is based on that task's UID which is built from hashes of input, output, and params. The task directory can also be overwritten by the user. See
As long as the source is a file or stream it can work.
Not yet - what do you have in mind for this? It's fairly simple to build node CLIs using yargs or something similar.
You can run it again, and tasks will check if the hash of files found on disk match the stored hashes in watermill's task metadata folder, and will skip over already completed work. However, this is not 100% functional - I think there is the case where if you change parameters from an upstream task, the downstream one won't necessarily be forced to rerun
If you choose to use watermill in practice, please be weary that it may not cover all use cases you desire, there will be bugs, and the tool is undergoing rapid development (we have GSOC 2017 student now!). That being said, feedback, bug reports, suggestions for example pipelines, etc, is extremely useful.
A pipeline is written to deal with one sample at a time. How can be the pipeline extended to deal with multiple samples?