LieberInstitute / SPEAQeasy

SPEAQeasy: portable LIBD RNA-seq pipeline using Nextflow. Check http://research.libd.org/SPEAQeasy-example/ for an example on how to use this pipeline and analyze the resulting output files.
http://lieberinstitute.github.io/SPEAQeasy
MIT License
6 stars 4 forks source link

allow starting the workflow with existing read alignments (sorted BAM/CRAM files) #92

Open gpertea opened 1 year ago

gpertea commented 1 year ago

There are situations where users want to change e.g. featureCounts options, or bring in already prepared read alignments (sorted BAMs) so in such cases it would benefit to have the option to skip the HISAT2/STAR alignment step and proceed with the given alignments as "input" for the other steps in the pipeline.

I am aware this would involve skipping any steps that depend on the FASTQ files (which means not (re)generating the rse_tx object, and not including any fastqc metrics in colData etc.). However, there are ways to generate the rse_tx object from the BAM files (I can help with implementing that option)

It seems that the BIOCMap workflow was in part split into 2 nextflow scripts for a similar reason, if I am not mistaken. A similar interim/simpler solution might the way to address this request initially - create an alternate workflow besides main.nf that would work on BAM (or CRAM) files and run only the steps related to the read alignment data (featureCounts etc.), (with an option to built rse_tx from the provided alignment data).

gpertea commented 1 year ago

I can help with most of the shell and R code necessary to implement this alternate workflow (as I already have some non-user-friendly scripts doing that), but I would need some help with the nextflow code/implementation.

Nick-Eagles commented 1 year ago

BiocMAP was split at "the same point" mostly on the thought that GPUs (used for alignment) might not be available on the same machines where massive CPU/memory resources (used for post-alignment steps) was available. I'm a bit concerned that there are too many ways a user might want to partially run SPEAQeasy (e.g. run transcript quantification again but not alignment, only call variants, etc), and this would be only one specific solution (and unfortunately Nextflow doesn't support this type of partial-running functionality without modifying/adding a lot of code). That said, if starting from aligned files is a repeated use case you're seeing, I can help out.

gpertea commented 1 year ago

Thank you Nick - perhaps the easiest approach at this point would be to help me put together a cut-down version of main.nf that can take as input the BAM files (different samples.manifest? or just point to a directory with the sorted BAM files?) and then run only the branches of the workflow that depend on those alignments (we could even add another input to be the colData needed to (re) build rse_gene and rse_exon I suppose).

I can take care of the R scripts there (like create_count_objects.R) to make them ignore the transcript assays if they are not available etc. but the nextflow part itself was the problem for me - my limited experience with nextflow (and time constraints) prevented me from attempting this by myself.