databio / pypiper

Python toolkit for building restartable pipelines
http://pypiper.databio.org
BSD 2-Clause "Simplified" License
45 stars 9 forks source link

Can Pypiper generate a DAG to guide the execution of commands that comprise a pipeline? #189

Open zhangzhen opened 1 year ago

zhangzhen commented 1 year ago

As far as I know, Pypiper runs commands of a pipeline sequentially, even if some commands can be run concurrently. Will you plan to support the concurrent execution in the near future?

Cheers, Zhen Zhang

vreuter commented 1 year ago

Hey @zhangzhen thanks for this question and idea. I'm not currently developing on pypiper so can't answer about definite development plans, but I can say that you're correct, there's not currently support to declare dependencies among the steps or "stages" of a pipeline. Any concurrency would need to be implemented manually in a pipeline script, and if subclassing Pipeline and defining the stages, the implicit dependency structure among the steps/stages is that they're sequential and to be executed serially.

I could add, though, that I'd love this feature (DAG-like declaration of the relationships among the pipeline's steps, and then automatic conncurrent execution where possible, based on that structure) and would definitely use it! If you're interested in prototyping, I think a PR would be welcome, certainly by me and I think probably by the maintainers, though it's a question for @nsheff

nsheff commented 1 year ago

You are right that pypiper is really intended to run sequentially. Our mode of operating is to parallelize by sample, rather than by task within a pipeline. This has lots of advantages, and a few disadvantages -- but for most of the analysis we're doing, it makes a lot of sense and you won't gain any/much efficiency by parallelizing by task if you're parallelizing effectively by sample. Making your pipeline parallel by task also can add complexity to the pipeline, so it isn't always worth it.

That said, you can actually still make a pipeline parallize tasks in pypiper if you need to, it's just not a built-in, recommended thing to do. If you want some guidance on how to do it, let me know and I can show you.

nsheff commented 1 year ago

And to directly answer your question: I am not planning to add parallelizing by task like this. But if you want to add it, I would consider a PR, as long is it was a simple solution that didn't complicate the codebase too much.

zhangzhen commented 1 year ago

I've built bioinformatics pipelines for NGS testing in clinical oncology for more than 5 years. Pyflow and Nextflow are pipeline frameworks I use most of the time. Pyflow is light-weight and does well in sample-level analysis, while Nextflow is heavy-weight and does well in batch-level analysis. However, they both adopt the monolithic approach that makes them do more things than they should do. The modular approach you come up with is the better way to build pipeline frameworks. The philosophy behind a series of softwares such as looper, pypiper, bulker is what I love and brings me inspiration. Moreover, one of your posts helps me form a clearer picture on parallelism in bioinformatics. It's a bit of a pity that I know the work your lab and you have done just a few days ago.

That said, you can actually still make a pipeline parallize tasks in pypiper if you need to, it's just not a built-in, recommended thing to do. If you want some guidance on how to do it, let me know and I can show you.

Pipelines in clinical oncology have indeed such needs. After doing reads mapping, variants calling such as SNV/INDEL calling, CNV calling, SV calling, etc., and QC are often performed simultaneously. Hey @nsheff, could you please show me how to parallelize tasks within a pipeline?

Thanks a lot!