bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

must the mutation spike-in singularity pipeline be processed in a single HPC run? #105

Closed kiranchari closed 2 years ago

kiranchari commented 2 years ago

Hi,

I am using the singularity pipeline to spike-in mutations using bamsurgeon (https://github.com/bioinform/somaticseq/tree/master/utilities/singularities/bamSimulator).

I started running a sample but the HPC job ended before the script was complete. Do I need to restart the script from scratch (i.e. delete any outputs from the first run) or can the script automatically pick up where it left off?

Any suggestions to avoid re-running the sample from scratch?

Also, the Bamsurgeon process is very slow - I am trying to spike-in just a few thousand mutations in a small (<5GB) BAM file. Any tips to speed this up?

Thanks

litaifang commented 2 years ago

The bamsurgeon workflow has not been updated in this repo for a while, and I haven't had the time to make sure it works on singularity, although I think the docker version should still work https://github.com/bioinform/somaticseq/tree/master/somaticseq/utilities/dockered_pipelines/bamSimulator. You may use it generate the scripts and see how to modify it for singularity. The script isn't smart enough to start where it was left off. You may just find the .cmd scripts, make a copy, delete the commands that have already worked, and manually run the parts that have not completed.

The process is quite slow even with just a few thousand mutations because a in silico mutated read needs to be aligned again. Some of the mutated reads may align to somewhere else, hence the whole bam will then needed to be sorted. Sorting bam is a time-consuming process.

kiranchari commented 2 years ago

Thanks for your response