Closed rahil19 closed 2 years ago
Hi Rahil, it really depends on your use case. If you for example have a viral amplification sample with ultrahigh coverage (a use case LoFreq was designed for) then MarkDuplicates would mark pretty much everything and thus remove almost all hits. Even for E.coli sized genomes MarkDuplicates can have negative effects without positive ones. That might be different for exomes and PCR amplified human genome. In the latter case I would recommend to run MarkDuplicates, but it's pretty much the only one.
On an unrelated note: at this point it's probably still best to use LoFreq2 https://github.com/CSB5/lofreq. LoFreq3 is still largely untested
Hi @andreas-wilm, I looked into this option after I encountered problem when running Lofreq v 2.1.5. It did not generate AF value for one of variant calls accurately, after confirming with other variant callers and IGV browser. I had to calculate frequency using DP4. Then I saw your message in one of the issues you left it opened in https://github.com/CSB5/lofreq/issues/80 where you recommended not to use any quality filter as it can lead to error in reporting. However, it is common practice I've seen is to filter called variants in quality. So you suggested to use LoFreq v3 where Pileup and variant calling are kept separate. I understand MarkDuplicates in viruses and bacteria genomes can lead to marking of excessive number of reads and hence not recommended. However, I was following one of the published Galaxy protocol for SARS-CoV-2 variant calling
1. Map all reads against COVID-19 reference NC_045512.2 (opens new window)using bwa mem
2. Filter reads with mapping quality of at least 20, that were mapped as proper pairs
3. Mark duplicate reads with picard markduplicates
4. Perform realignments using lofreq viterbi
5. Call variants using lofreq call
6. Annotate variants using snpeff against database created from NC_045512.2 GenBank file
7. Convert VCFs into tab delimited dataset
https://covid19.galaxyproject.org/genomics/no-more-business-as-usual/#benchmarking-callers-lofreq-is-the-best-choice I also personally checked read stats for one of the SARS-CoV-2 Illumina NGS sample, before and after removing duplicates using samtools and didn't see much difference in read count, indicating that marking duplicates will not have significant effect. To be on a safe side I went with steps that were conducted in general.
Thanks, Rahil
@andreas-wilm In your Github usage documentation
I don't see any step of mark duplicates prior to running Lofreq modules for variant calling. May I know why? I'll be running tests on my samples to see any differences with and without adding PICARD Mark Duplicates prior to running lofreq viterbi. Even if I don't see any differences on my test samples, I suspect the reliability of code without Mark Duplicates. Don't you think? I've never ran samtools fixmate for variant calling using any other tool lofreq v2.1.5, bcftools, freebayes, VirVarSeq, etc. Is that one of the necessary steps for lofreq v3 before running
lofreq viterbi
?