andersen-lab / ivar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
https://andersen-lab.github.io/ivar/html/
GNU General Public License v3.0
115 stars 39 forks source link

ivar variants: reducing q-score threshold results in high P-value for deletion #96

Open pmenzel opened 3 years ago

pmenzel commented 3 years ago

Dear @gkarthik ,

I tried to change the option -q in ivar variants to values below the default 20 and discovered a problem with calling deletions once the q-score threshold is set to a value below 10. My data is SARS-CoV2 and the deletion in question is the del69/70 in spike at position 21764.

With this command I tried all values for -q from 0 to 20:

for i in `seq 0 20`;
do
  samtools mpileup -A -d 0 --reference nCoV-2019.reference.fasta -B -Q 0 input.bam | \
  ivar variants -r nCoV-2019.reference.fasta -m 20 -p q$i.variants -q $i -t 0.25;
done

The output for position 21764 is:

grep 21764 q*.variants.tsv | column -t -s$'\t'
q0.variants.tsv:MN908947.3   21764  A  -TACATG  4079  4070  33  4058  0  0   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q1.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  1   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q2.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  2   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q3.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  3   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q4.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  4   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q5.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  5   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q6.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  6   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q7.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  7   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q8.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  8   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q9.variants.tsv:MN908947.3   21764  A  -TACATG  4070  4061  34  4058  0  9   0.990239  4098  1  FALSE  NA  NA  NA  NA  NA
q10.variants.tsv:MN908947.3  21764  A  -TACATG  4070  4061  34  4058  0  10  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q11.variants.tsv:MN908947.3  21764  A  -TACATG  4070  4061  34  4058  0  11  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q12.variants.tsv:MN908947.3  21764  A  -TACATG  4070  4061  34  4058  0  12  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q13.variants.tsv:MN908947.3  21764  A  -TACATG  4070  4061  34  4058  0  13  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q14.variants.tsv:MN908947.3  21764  A  -TACATG  4070  4061  34  4058  0  14  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q15.variants.tsv:MN908947.3  21764  A  -TACATG  3891  3882  34  4058  0  15  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q16.variants.tsv:MN908947.3  21764  A  -TACATG  3891  3882  34  4058  0  16  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q17.variants.tsv:MN908947.3  21764  A  -TACATG  3891  3882  34  4058  0  17  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q18.variants.tsv:MN908947.3  21764  A  -TACATG  3891  3882  34  4058  0  18  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q19.variants.tsv:MN908947.3  21764  A  -TACATG  3891  3882  34  4058  0  19  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA
q20.variants.tsv:MN908947.3  21764  A  -TACATG  3891  3882  34  4058  0  20  0.990239  4098  0  TRUE   NA  NA  NA  NA  NA

For -q 10 and above, the variant is found with P-value = 0, and for -q 9 and below, it is not found with P-value = 1.

The main difference between the files is that the column ALT_QUAL takes on exactly the value set for -q, probably because of https://github.com/andersen-lab/ivar/blob/ce1490a70806ea33b225e9cb1b89f15cf0fa0271/src/allele_functions.cpp#L90

This value is then used here https://github.com/andersen-lab/ivar/blob/ce1490a70806ea33b225e9cb1b89f15cf0fa0271/src/call_variants.cpp#L138 for calculating the P-value, and it seems that it somehow misbehaves at values below 10?

pmenzel commented 2 years ago

Seeing the same issue with the 6p deletion GAGTTCA22028G:EFR156G in Delta:

When setting -q 10, the P-value is all good:

samtools mpileup -A -d 0 --reference nCoV-2019.reference.fasta -B -Q 0 mapped.primertrimmed.sorted.bam | \
grep -w 22028 | \
ivar variants -r nCoV-2019.reference.fasta -m 20 -p variants -q 10 -t 0.25 && column -t -s$'\t' variants.tsv
[mpileup] 1 samples in 1 input files
[mpileup] Max depth set to maximum value (2147483647)
A GFF file containing the open reading frames (ORFs) has not been provided. Amino acid translation will not be done.
REGION      POS    REF  ALT      REF_DP  REF_RV  REF_QUAL  ALT_DP  ALT_RV  ALT_QUAL  ALT_FREQ  TOTAL_DP  PVAL         PASS  GFF_FEATURE  REF_CODON  REF_AA  ALT_CODON  ALT_AA
MN908947.3  22028  G    -AGTTCA  25      25      34        25      0       10        1         25        0.000322779  TRUE  NA           NA         NA      NA         NA

When setting -q 0, the P-value is set to 1:

samtools mpileup -A -d 0 --reference nCoV-2019.reference.fasta -B -Q 0 mapped.primertrimmed.sorted.bam | \
grep -w 22028 | \
ivar variants -r nCoV-2019.reference.fasta -m 20 -p variants -q 0 -t 0.25 && column -t -s$'\t' variants.tsv
[mpileup] 1 samples in 1 input files
[mpileup] Max depth set to maximum value (2147483647)
A GFF file containing the open reading frames (ORFs) has not been provided. Amino acid translation will not be done.
REGION      POS    REF  ALT      REF_DP  REF_RV  REF_QUAL  ALT_DP  ALT_RV  ALT_QUAL  ALT_FREQ  TOTAL_DP  PVAL  PASS   GFF_FEATURE  REF_CODON  REF_AA  ALT_CODON  ALT_AA
MN908947.3  22028  G    -AGTTCA  25      25      34        25      0       0         1         25        1     FALSE  NA           NA         NA      NA         NA