andersen-lab / ivar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
https://andersen-lab.github.io/ivar/html/
GNU General Public License v3.0
118 stars 40 forks source link

Deletions being incorporated into consensus genome when not expected #83

Open charlesfoster opened 3 years ago

charlesfoster commented 3 years ago

Describe the bug I'm working with SARS-CoV-2. I only want to incorporate indels into a variant call file with ivar variants and consensus genome generated with ivar consensus when a certain frequency of reads support the indel. I'm able to achieve this outcome withivar variants using the -t flag.

For example, consider a situation where 24% of the reads support a 10-bp deletion, and the remaining 76% of the reads support the reference bases. If I use the following command:

samtools mpileup -A -d 0 -B -Q 0 --reference NC_045512.fasta sample_1.primertrim.sorted.bam | ivar variants -p sample_1.ivar -q 20 -t 0.25 -r NC_045512.fasta -g reference_features.gff3

Then the resulting sample_1.ivar.tsv file does not include a 10-bp deletion, as desired, because of the -t 0.25 flag. However, the same does not hold true when calling the consensus genome. If I use the following command:

samtools mpileup -aa -A -d 60000 -B -Q 0 --referenceNC_045512.fasta sample_1.primertrim.sorted.bam | ivar consensus -p sample_1.consensus -n N -t 0.25

Then the resulting sample_1.consensus.fa file does include the deletion, even though I would like it not to. Is this related to #79 ?

Expected behavior I guess the expected behaviour would be to include the reference bases if enough reads support them, or to just include Ns for the stretch of the putative deletion.

Desktop (please complete the following information):

Additional context Apologies if this isn't a bug and is just me misinterpreting the settings!

Cheers, Charles

IsabelFE commented 3 years ago

Hi @charlesfoster,

I am having some similar issues. I described them in https://github.com/andersen-lab/ivar/issues/85.

Would you mind sharing what settings did you end up using for mpileup and ivar consensus/variants? Also, why are you using different mpileup settings for ivar consensus vs. ivar variants? I see that you are using --referenceNC_045512.fasta, but also -B. Aren't those supposed to cancel each other? Or I am understanding -B wrong?

Thanks

charlesfoster commented 3 years ago

Hi @IsabelFE,

Different settings/what we're using I inherited the pipeline we're using here, and it seems like the mpileup commands were just chosen based on the usage command given when running each tool & written in the documentation (https://andersen-lab.github.io/ivar/html/manualpage.html). Not sure why they were different previously. The samtools mpileup settings we're using at the moment for variant calling and consensus building, based on the documentation:

For ivar variants: samtools mpileup -aa -A -d 0 -B -Q 0 --reference [<reference-fasta] <input.bam>

For ivar consensus: samtools mpileup -aa -A -d 0 -Q 0 <input.bam>

Note: we still see issues with indel calling, as per the purpose of this Issue. I end up having to check the veracity of indels using IGV. Less than ideal.

Using -B with --reference Sadly I don't know enough about mpileup to know whether they truly do 'cancel out'/ are incompatible.

Other stuff There's a good chat around BAQ and variant calling with amplicon data here: https://www.biostars.org/p/9466154/#9466582. I'm trying to figure out what I might change in our pipeline based on the discussion.

Charles

IsabelFE commented 3 years ago

Thanks @charlesfoster,

We also have issues with indel calling and I agree that double checking them in IGV is not ideal.

I would really appreciate if anyone else have recommendations for ideal settings for mpileup and ivar variants/consensus for SARS-CoV-2 amplicon sequencing.

Isabel