bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

Slow RNA variant calling #118

Open katjur01 opened 1 year ago

katjur01 commented 1 year ago

Hello,

I am trying to run SomaticSeq on RNA data (single-end reads) but it's really slow. It never finished because I had to kill it after 7 days. I have many large samples (~10 GB) and if the smallest one (3.0 GB) takes more than 7 days to finish, I can't use it like this. It could run for months. I’m using Kubernetes so I have enough computing capacity.

First, I used SplitNCigarReads tool (https://gatk.broadinstitute.org/hc/en-us/articles/360036858811-SplitNCigarReads) on my mapped RNA and after that, I ran variant callers (lofreq, mutect2, strelka, vardict, varscan). I ran SomaticSeq as the last with command: somaticseq_parallel.py --threads 20 --output-directory somatic_varcalls/sample1 --genome-reference GRCh38-p10.fa --inclusion-region wgs.bed --minimum-num-callers 0.4 single --bam-file sample1.RNAsplit.bam --mutect2-vcf somatic_varcalls/sample1/MuTect2.vcf --vardict-vcf somatic_varcalls/sample1/VarDict.vcf --lofreq-vcf somatic_varcalls/sample1/Lofreq.vcf --strelka-vcf somatic_varcalls/sample1/variants.vcf.gz --varscan-vcf somatic_varcalls/sample1/VarScan2.vcf

I also tried to run SomaticSeq only with one variant caller. First, just with Vardict and it took 51 hours to finish. Second, just with Strelka and it took 32 hours to finish. I also tried to use a smaller bed file (only a few exome positions) but nothing changed.

My theory is that SomaticSeq has a problem when it encounters heavily covered reads because when it splits bed file for parallelization, some were counted fast but some took many hours or days. I think that maybe if I will do some subset of those heavily covered areas, It could help but I still don't know how to approach this.

Do you have any idea or advice on what can I do with it? I used SomaticSeq on DNA data many times before, so I know that normally it ran from a few minutes to a few hours.

litaifang commented 1 year ago

Is there like one particular region that take forever to finish? SomaticSeq's parallelization isn't implemented most efficiently (I could improve it a little bit). When you specify 12 threads, it will split a bed file into 12 equal-sized regions, but some regions may have many times more reads than others. I think a quick solution for that is to split into like 120 equal-sized regions and have 12 threads work its way through all of them, so that each one region will be small enough that it won't take forever..... if the slowdown is caused by one vastly high depth region, or region with lots of complex alignments (chrM and alt contig alignments often have complex alignments that take time in local consensus for indel.).

katjur01 commented 1 year ago

Thank you for your reply.

There are many problematic regions, not just one. I tried to set threads to 120 but it behaves exactly the same as before. It’s 6 days now and it is still running. There were 96 regions (from the split bed file) that finished in 1 hour, 15 regions finished in 1 day, 6 regions in 3 days and 3 are still running.

Do you have any other idea what can I try?

litaifang commented 1 year ago

Can you send me one of those regions of bam file and bed file?

katjur01 commented 1 year ago

Of course. The files are too big so I'm sending you the link using Filesender. There are subset.bam with one of the problematic regions 1:103334975-155002413, the whole bam file (just for sure) and bed file.

https://filesender.cesnet.cz/?s=download&token=d17a2d32-5950-46e4-9a9a-bfd2755a01b2

litaifang commented 1 year ago

Sorry just gotten around to it and the link expired. Was VarDict particularly slow on RNAseq or was somaticseq_parallel.py slow? For VarDict, you can try to turn off indel realignment and SV calling in VarDict.

katjur01 commented 1 year ago

No problem. I'm sending you the new link: https://filesender.cesnet.cz/?s=download&token=a91a531a-6c7a-4165-89d6-8305301c6528

The SomaticSeq itself was slow. Callers took some time but it wasn't something crazy considering the size of the dataset.

katjur01 commented 1 year ago

Hello @litaifang,

I just wanted to follow up on our conversation regarding the slow performance of SomaticSeq on RNA data. I understand you may have been busy, but I wanted to check if you had the chance to look into the files I shared with you.

By the way, I wanted to mention that I managed to analyze my current RNA data using alternative methods, but I am still curious about resolving the performance issue in SomaticSeq. I wanted to keep exploring it because in the future, there might be more RNA data that I'll need to analyze.

If you've had the opportunity to review the files, I would greatly appreciate any insights or suggestions you may have.

litaifang commented 1 year ago

Just got back from vacation. I'll take a look in the next couple of weeks when I get more time.