databio / pypiper

Python toolkit for building restartable pipelines
http://pypiper.databio.org
BSD 2-Clause "Simplified" License
45 stars 9 forks source link

how to deal with huge fastq files #229

Open zhangzhen opened 1 month ago

zhangzhen commented 1 month ago

Nextflow adopts the scatter-gather method to process huge fastq files. First, split one huge fastq file into multiple smaller fastq files, and then submit jobs that process each individual fastq file to the batch system. Last, merge their results from individual processing to form the sample-level result. What is the pypiperic way to do that?

vreuter commented 1 month ago

Hi @zhangzhen , pypiper wasn't really designed to do partitioning and parallelism, but rather to be applied to something that's already partitioned/chunked/etc., either naturally (e.g., biological samples) or artificially (e.g., how you could split the FASTQ arbitrarily). pepkit/looper would be how you'd normally do this sort of thing (submission of a single pypiper pipeline to multiple pieces of data). @donaldcampbelljr or @nsheff may have more recent information, though, as I've not worked in depth on the project in a while.