No Mark Duplicates mentioned in recommended lofreq v3 pipeline

rahil19 commented 2 years ago

@andreas-wilm In your Github usage documentation

obam=your-output.bam;
reffa=your-ref.fasta
fq1=your-R1.fq.gz
fq2=your-R2.fq.gz
bwa mem $reffa  $fq1 $fq2 | \
  samtools fixmate - - | \
  lofreq viterbi -f $reffa -b - | \
  samtools sort - | \
  lofreq indelqual -f $reffa -b - | \
  lofreq alnqual -f $reffa  -b - | \
  samtools view -b - -o $obam;

  lofreq call -b aln.bam -f ref.fa -r chr:s-e

I don't see any step of mark duplicates prior to running Lofreq modules for variant calling. May I know why? I'll be running tests on my samples to see any differences with and without adding PICARD Mark Duplicates prior to running lofreq viterbi. Even if I don't see any differences on my test samples, I suspect the reliability of code without Mark Duplicates. Don't you think? I've never ran samtools fixmate for variant calling using any other tool lofreq v2.1.5, bcftools, freebayes, VirVarSeq, etc. Is that one of the necessary steps for lofreq v3 before running lofreq viterbi?

andreas-wilm commented 2 years ago

Hi Rahil, it really depends on your use case. If you for example have a viral amplification sample with ultrahigh coverage (a use case LoFreq was designed for) then MarkDuplicates would mark pretty much everything and thus remove almost all hits. Even for E.coli sized genomes MarkDuplicates can have negative effects without positive ones. That might be different for exomes and PCR amplified human genome. In the latter case I would recommend to run MarkDuplicates, but it's pretty much the only one.

On an unrelated note: at this point it's probably still best to use LoFreq2 https://github.com/CSB5/lofreq. LoFreq3 is still largely untested

rahil19 commented 2 years ago

Hi @andreas-wilm, I looked into this option after I encountered problem when running Lofreq v 2.1.5. It did not generate AF value for one of variant calls accurately, after confirming with other variant callers and IGV browser. I had to calculate frequency using DP4. Then I saw your message in one of the issues you left it opened in https://github.com/CSB5/lofreq/issues/80 where you recommended not to use any quality filter as it can lead to error in reporting. However, it is common practice I've seen is to filter called variants in quality. So you suggested to use LoFreq v3 where Pileup and variant calling are kept separate. I understand MarkDuplicates in viruses and bacteria genomes can lead to marking of excessive number of reads and hence not recommended. However, I was following one of the published Galaxy protocol for SARS-CoV-2 variant calling

1. Map all reads against COVID-19 reference NC_045512.2 (opens new window)using bwa mem
2. Filter reads with mapping quality of at least 20, that were mapped as proper pairs
3. Mark duplicate reads with picard markduplicates
4. Perform realignments using lofreq viterbi
5. Call variants using lofreq call
6. Annotate variants using snpeff against database created from NC_045512.2 GenBank file
7. Convert VCFs into tab delimited dataset

https://covid19.galaxyproject.org/genomics/no-more-business-as-usual/#benchmarking-callers-lofreq-is-the-best-choice I also personally checked read stats for one of the SARS-CoV-2 Illumina NGS sample, before and after removing duplicates using samtools and didn't see much difference in read count, indicating that marking duplicates will not have significant effect. To be on a safe side I went with steps that were conducted in general.

Thanks, Rahil

andreas-wilm / lofreq3

No Mark Duplicates mentioned in recommended lofreq v3 pipeline #43