Open jts opened 3 years ago
Hello Jared, sorry about the delay in response, we've been occupied with epi work. Thank you for pointing out this issue.
As you stated, I would expect consensus calling to work in this case. Filtering variant calls based on the position in the read might help eliminate some of these false positives. We need to think about how to implement a little more but will keep you updated!
Hello,
I am having exactly the same issue around position 11287 of the SARS-CoV-2 genome. Based on the BAM file there is an obvious 9nts deletion. I see this in many genomes, and some of them had been rejected by GISAID because the consensus from ivar indicates only a 7-8nts gap.
Here you can see an example bam file on igv:
I calculated the consensus with several settings for mpileup and ivar consensus:
Using -f in mpileup is also helping in situations like these were a mutation is called at the very end of a read:
However, I am also finding situations where maybe using -f in mpileup is creating artifacts. This frameshift deletion in position 28271 is also present in many genomes. Here you can see how 948 reads support the deletion, while 176 call an A: In this case, only the settings using -f call an A in this position, while the rest identify the gap. I think that this gap is well supported, so I am not sure about using -f
@jts, would you mind sharing what settings did you end up using for mpileup and ivar consensus? Thanks
Hey Isabel, any chance you could share the mpileup file for that region here?
The BAQ score calculation helps but in certain cases like you pointed out introduces artifacts which is why typically I didn't include the -f
flag for consensus calling but this requires a closer look on my end.
Here you have the pileups for the conflictive regions using all the settings I tried:
So, BAQ is helping in the first 3 situations, but doesn't identify the indel in the 4th one.
Hi, I think that the problem is coming from an inconsistency between ivar consensus
and ivar variants
. If I first create a pile file with samtools mpileup
like this:
samtools mpileup -d 0 -q 30 -f wuhan_trim3polyA.fasta "2021_0405_01C21.sorted.bam" > "2021_0405_01C21.pile"
And them I pipe that file into both ivar consensus
and ivar variants
with the same settings:
< "2021_0405_01C21.pile" | ivar consensus -t 0.6 -m 10 -p "2021_0405_01C21.consensus"
< "2021_0405_01C21.pile" | ivar variants -t 0.6 -m 10 -r wuhan_trim3polyA.fasta -g GCF_009858895.2_ASM985889v3_genomic.gff -p "2021_0405_01C21.variant"
Then, the variants file looks great:
REGION POS REF ALT REF_DP REF_RV REF_QUAL ALT_DP ALT_RV ALT_QUAL ALT_FREQ TOTAL_DP PVAL PASS GFF_FEATURE REF_CODON REF_AA ALT_CODON ALT_AA
NC_045512.2 241 C T 62 18 39 6507 2431 38 0.983376 6617 0 TRUE NA NA NA NA NA
NC_045512.2 1059 C T 15 10 35 3020 2418 32 0.993421 3040 0 TRUE cds-YP_009724389.1 ACC T ATC I
NC_045512.2 1059 C T 15 10 35 3020 2418 32 0.993421 3040 0 TRUE cds-YP_009725295.1 ACC T ATC I
NC_045512.2 3037 C T 76 38 34 9375 3757 35 0.988924 9480 0 TRUE cds-YP_009724389.1 TTC F TTT F
NC_045512.2 3037 C T 76 38 34 9375 3757 35 0.988924 9480 0 TRUE cds-YP_009725295.1 TTC F TTT F
NC_045512.2 3987 C T 6 5 34 709 368 33 0.988842 717 0 TRUE cds-YP_009724389.1 ACA T ATA I
NC_045512.2 3987 C T 6 5 34 709 368 33 0.988842 717 0 TRUE cds-YP_009725295.1 ACA T ATA I
NC_045512.2 6101 G A 4 0 32 35 18 33 0.897436 39 8.225e-18 TRUE cds-YP_009724389.1 GGT G AGT S
NC_045512.2 6101 G A 4 0 32 35 18 33 0.897436 39 8.225e-18 TRUE cds-YP_009725295.1 GGT G AGT S
NC_045512.2 7201 A G 1 1 34 3125 1260 35 0.99681 3135 0 TRUE cds-YP_009724389.1 TCA S TCG S
NC_045512.2 7201 A G 1 1 34 3125 1260 35 0.99681 3135 0 TRUE cds-YP_009725295.1 TCA S TCG S
NC_045512.2 8809 C T 11 2 34 2396 784 36 0.991722 2416 0 TRUE cds-YP_009724389.1 GAC D GAT D
NC_045512.2 8809 C T 11 2 34 2396 784 36 0.991722 2416 0 TRUE cds-YP_009725295.1 GAC D GAT D
NC_045512.2 9867 T C 2 1 34 164 109 33 0.964706 170 3.37416e-97 TRUE cds-YP_009724389.1 CTT L CCT P
NC_045512.2 9867 T C 2 1 34 164 109 33 0.964706 170 3.37416e-97 TRUE cds-YP_009725295.1 CTT L CCT P
NC_045512.2 11287 G -TCTGGTTTT 506 67 30 7986 0 20 0.8543 9348 0 TRUE NA NA NA NA NA
NC_045512.2 11291 G A 11 6 32 30 12 28 0.731707 41 2.92495e-49 TRUE cds-YP_009724389.1 GGT G AGT S
NC_045512.2 11291 G A 11 6 32 30 12 28 0.731707 41 2.92495e-49 TRUE cds-YP_009725295.1 GGT G AGT S
NC_045512.2 11296 T G 22 2 23 107 23 30 0.798507 134 3.44369e-213 TRUE cds-YP_009724389.1 TTT F TTG L
NC_045512.2 11296 T G 22 2 23 107 23 30 0.798507 134 3.44369e-213 TRUE cds-YP_009725295.1 TTT F TTG L
NC_045512.2 12784 C T 1622 544 35 4358 1900 35 0.72791 5987 0 TRUE cds-YP_009724389.1 AAC N AAT N
NC_045512.2 12784 C T 1622 544 35 4358 1900 35 0.72791 5987 0 TRUE cds-YP_009725295.1 AAC N AAT N
NC_045512.2 14408 C T 7 1 32 1669 681 35 0.992861 1681 0 TRUE cds-YP_009724389.1 CCT P CTT L
NC_045512.2 16500 A C 0 0 0 685 334 34 0.991317 691 0 TRUE cds-YP_009724389.1 CAA Q CAC H
NC_045512.2 21575 C T 4 2 34 558 250 37 0.989362 564 0 TRUE cds-YP_009724390.1 CTT L TTT F
NC_045512.2 21846 C T 4 1 34 699 317 34 0.99431 703 0 TRUE cds-YP_009724390.1 ACT T ATT I
NC_045512.2 22021 G T 6 3 34 606 282 36 0.985366 615 0 TRUE cds-YP_009724390.1 ATG M ATT I
NC_045512.2 22320 A G 1 0 68 947 393 40 0.997893 949 0 TRUE cds-YP_009724390.1 GAT D GGT G
NC_045512.2 22992 G A 22 7 35 4086 2126 35 0.989826 4128 0 TRUE cds-YP_009724390.1 AGC S AAC N
NC_045512.2 23403 A G 4 0 42 5323 2094 37 0.998687 5330 0 TRUE cds-YP_009724390.1 GAT D GGT G
NC_045512.2 24432 A G 4 1 30 2437 1168 37 0.997952 2442 0 TRUE cds-YP_009724390.1 CAA Q CGA R
NC_045512.2 25517 C T 77 29 36 2877 1199 38 0.971959 2960 0 TRUE cds-YP_009724391.1 CCT P CTT L
NC_045512.2 25563 G T 2 0 34 3107 1603 39 0.994558 3124 0 TRUE cds-YP_009724391.1 CAG Q CAT H
NC_045512.2 25830 T C 12 5 40 4634 2891 36 0.995917 4653 0 TRUE cds-YP_009724391.1 TTT F TTC F
NC_045512.2 25968 A G 8 3 33 5275 2922 36 0.99773 5287 0 TRUE cds-YP_009724391.1 AAA K AAG K
NC_045512.2 27739 C T 31 26 36 4565 3542 34 0.992176 4601 0 TRUE cds-YP_009724395.1 CTC L TTC F
NC_045512.2 27925 C T 1 0 34 570 341 33 0.991304 575 0 TRUE cds-YP_009724396.1 ACA T ATA I
NC_045512.2 28270 T -A 1012 376 36 836 0 20 0.822835 1016 2.97724e-179 TRUE NA NA NA NA NA
NC_045512.2 28311 C T 11 1 50 1166 447 39 0.98563 1183 0 TRUE cds-YP_009724397.2 CCC P CTC L
NC_045512.2 28879 T G 21 5 37 6232 2168 39 0.995527 6260 0 TRUE cds-YP_009724397.2 AGT S AGG R
It identifies the indels at positions 11287 and 28270 without calling those mutations at the edges of reads on 19680 and 29738 that were not well supported.
But, the consensus read fails to call the -A at position 28270. How is this possible?
Hi @gkarthik ! Hope you are well! I have also seen these sorts of issues prop up recently for real samples from the B.1.318 lineage that I believe contain a 9bp deletion. We only really realised after the records were rejected by GSAID. As you can see in the image below, and as mentioned earlier it does seem to be an issue with the way that standard aligners choose to introduce indels around the ends of reads - Bowtie2 in this case.
This discrepancy was initially observed after running viralrecon with default parameters:
nextflow run nf-core/viralrecon \
--input samplesheet.csv \
--platform illumina \
--protocol amplicon \
--genome 'MN908947.3' \
--primer_set artic \
--primer_set_version 3 \
--skip_assembly \
-r 2.1 \
-profile singularity
I have created a minimal test example for you to be able to hopefully fully reproduce the error: test_9bp_deletion.tar.gz
Install Conda environment with ivar
and muscle
conda env create -f environment.yml
Run ivar variants
via default parameters used in viralrecon
samtools mpileup \
--count-orphans \
--no-BAQ \
--max-depth 0 \
--min-BQ 0 \
--reference nCoV-2019.reference.fasta \
test.ivar_trim.sorted.bam | \
ivar variants \
-t 0.25 \
-q 20 \
-m 10 \
-g GCA_009858895.3_ASM985889v3_genomic.200409.gff \
-r nCoV-2019.reference.fasta \
-p test
Run ivar consensus
via default parameters used in viralrecon
samtools mpileup \
--count-orphans \
--no-BAQ \
--max-depth 0 \
--min-BQ 0 \
-aa \
--reference nCoV-2019.reference.fasta \
test.ivar_trim.sorted.bam | \
ivar consensus \
-t 0.75 \
-q 20 \
-m 10 \
-n N \
-p test.consensus
Concatenate reference genome with consensus generated by ivar consensus
cat nCoV-2019.reference.fasta test.consensus.fa > multi.fa
Generate multiple-sequence alignment between the reference genome and the consensus generated by ivar consensus
muscle -in multi.fa -out msa.clustal -clw
Get list of deletions called by ivar variants
- THESE CALLS LOOK FINE BECAUSE WE HAVE A 9BP DELETION CALLED
$ awk '$4 ~ /^-/ { print $0; }' test.tsv
MN908947.3 11287 G -TCTGGTTTT 19490 7667 42 19120 0 20 0.853648 22398 0 TRUE NA NA NA NA NA
Get deletions in consensus by ivar consensus
- THESE CALLS LOOK WRONG BECAUSE WE HAVE A 4BP DELETION CALLED ADJACENT TO 5Ns
$ grep -B 2 "-" msa.clustal
MN908947.3 TAGTTTGTCTGGTTTTAAGCTAAAAGACTGTGTTATGTATGCATCAGCTGTAGTGTTACT
Consensus_test.consensus_thresho TAGTTTGNNNNN----AAGCTAAAAGACTGTGTTATGTATGCATCAGCTGTAGTGTTACT
Please let me know if you need any further information! Thanks :)
Hi @drpatelh, I also ran into this because our samples were rejected by GISAID when the gap is called with 5, 7 or 8 its, instead of the in frame 9nt deletion.
Hi,
I had a similar problem as @drpatelh and @IsabelFE with a Delta variant (B.1.617.2), which has overwhelming evidence for a 6 bp deletion after position 28247. However, I'm unsure if it's related to the alignment artifacts.
My initial command provided a reference to mpileup, which turns on BAQ:
samtools mpileup -A -d 0 -Q 0 --reference NC_045512.fasta sample.primertrim.sorted.bam | ivar consensus -p sample.consensus -t 0.9 -n N
REGION | POS | REF | ALT | REF_DP | REF_RV | REF_QUAL | ALT_DP | ALT_RV | ALT_QUAL | ALT_FREQ | TOTAL_DP | PVAL | PASS | GFF_FEATURE | REF_CODON | REF_AA | ALT_CODON | ALT_AA
NC_045512.2 | 28247 | A | -GATTTC | 2337 | 986 | 40 | 2233 | 0 | 20 | 0.900766 | 2479 | 0 | TRUE | NA | NA | NA | NA | NA
The consensus ends up with 2 gaps and 4 Ns in that region, presumably because of two weakly supported substitutions:
If I run ivar consensus
with the same command, except either skipping --reference
or adding the -b
/--no-BAQ
flag for mpileup
, I get the deletions incorporated as expected, despite a weakly supported substitution within the deletion span:
If I run ivar consensus
with -t 0
(to take the most common base at each site), and do not disable BAQ in mpileup
, the deletion is incorporated as I would expect.
samtools mpileup -A -d 0 -Q 0 --reference NC_045512.fasta sample.primertrim.sorted.bam | ivar consensus -p sample.t0.consensus -n N
However, this is not ideal, because a further deletion at position 28274 is skipped, despite being strongly supported. The skipped single base deletion is included if I disable BAQ to mpileup.
Below is an overall summary of what I get with all the different settings, with the 'correct' seqs having red boxes. It really demonstrates the unexpected differences with/without BAQ in the mpileup step.
Thanks, Charles
I am experiencing similar issue as @charlesfoster, @drpatelh and @IsabelFE. Currently spreading variants have common deletions at 21765-21770, 11288-11296, 21991-21993 and other positions. I only realised it after GISAID rejected some of our sequences. In my limited analysis, it is mostly caused by some of the reads being incorrectly mapped to the deletion region rather than the sequence before or after. I have noticed the same thing with ivar variants vs ivar consensus as @IsabelFE. Variants using same parameters as consensus identify deletions correctly.
I have seen errors and inconsistencies for indels as well e.g. 6bp INFRAME deletion at position 21766 being called a Gap of 5 nucleotide(s) (FRAMESHIFT). It is not a frameshift this is an inframe deletion.
Also, I have seen softclips under deletions that do not map to the other side of the deletion.
For small deletion calls one thing that seems easy is to make sure only reads embedded in the read sequence are used in the mpileup for calling. This can be done by limiting reads with cigar strings that match the regexp \d+M\d+D\d+M - later maybe the ones that have ends or soft-clips under the deletion can be inspected/added (similarly to things under primers) and for very large deletions maybe assembly can push the variation calling farther - I have tried these secondary steps but they extend beyond the scope of reference variant calling.
more stuff:
Example Read Sets:
bbbbbbb------bbbbbbb
bbbbbbbbsss bbbbbbbbssss -------------ssbbbbbbb
bbbbbbbbbbbbbbbbbbb
pileup: 4444444311113333333 5555555555555555555
freq=0.5
=> Wrong answer for consensus/variants bbbbbbbb-----bbbbbbb
because bbbbbbbb-----bbbbbbb
is NEVER observed!
in the case of our 5bp deletion, not a single read had that 5bp embedded in the cigar string as [0-9]+M[0-9]+D[0-9]+M!
[mcgold@grc002 Covid19]$ samtools view 768230.merged.trimmed.sorted.markeddups.bam | grep '5D' | grep -P 'GGTTCC|CCCAGAG' => nothing
many had [0-9]M6D[0-9]M and they match the indel flanks
[mcgold@grc002 Covid19]$ samtools view 768230.merged.trimmed.sorted.markeddups.bam | grep '6D' | grep -P 'GGTTCC|CCCAGAG'
MC:Z:16M6D58M MD:Z:16^TACATG58 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:62 XS:i:0 VH00123:1:AAAKJW7M5:1:2309:57636:13930 99 NC_045512.2 21749 60 16M6D70M15S = 21749 92 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCCTGTCTCTTATACAC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:15S16M6D70M MD:Z:16^TACATG70 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:74 XS:i:0 VH00123:1:AAAKJW7M5:1:2403:42393:43633 1187 NC_045512.2 21749 60 16M6D54M = 21749 76 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATG CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:16M6D54M MD:Z:16^TACATG54 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:58 XS:i:0 VH00123:1:AAAKJW7M5:1:2408:29952:32521 99 NC_045512.2 21749 60 16M6D85M = 21889 241 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCC MC:Z:101M MD:Z:16^TACATG85 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:89 XS:i:0 VH00123:1:AAAKJW7M5:1:2604:53697:12681 1123 NC_045512.2 21749 60 16M6D81M4S = 21749 103 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGACTGT CCCCCCCCCCC;CCCCCCCC;CC;C;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:4S16M6D81M MD:Z:16^TACATG81 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:85 XS:i:0 VH00123:1:AAAKJW7M5:1:2611:57011:1189 163 NC_045512.2 21749 60 16M6D85M = 21769 121 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:101M MD:Z:16^TACATG85 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:89 XS:i:0 VH00123:1:AAAKJW7M5:1:1109:51179:47760 147 NC_045512.2 21749 60 4S16M6D81M = 21749 -103 ACAGACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGA CCCCCCCCCCCCCCCCCC;C-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC MC:Z:16M6D81M4S MD:Z:16^TACATG81 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:85 XS:i:0 VH00123:1:AAAKJW7M5:1:1201:43832:35001 1107 NC_045512.2 21749 60 16M6D37M = 21749 -59 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCC CCCCCCCCCCCCCCCC-C-C;CCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCC MC:Z:16M6D37M MD:Z:16^TACATG37 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:41 XS:i:0 VH00123:1:AAAKJW7M5:1:1201:38928:35512 1107 NC_045512.2 21749 60 16M6D37M = 21749 -59 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCC CCCCCCCCCCCCCC;C-CCCCCCCCCCCCCCCCCCCCCCCCC-CCCCCCCCCC MC:Z:16M6D37M MD:Z:16^TACATG37 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:41 XS:i:0 VH00123:1:AAAKJW7M5:1:1305:22719:38882 147 NC_045512.2 21749 60 16M6D37M = 21749 -59 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCC CCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCC MC:Z:16M6D37M MD:Z:16^TACATG37 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:41 XS:i:0
So it was a 6bp deletion not a 5bp frameshift
We also ran into issues with gisaid submission, but as far as I can tell now, all is correct with the consensus calling. We checked the aligments in IGV and our variant calls which we do with out own pipeline MTBseq /https://github.com/ngs-fzb/MTBseq_source). We also observed 1bp Insertions as N in the consensus as well as deletions partially with Ns. Insertion was not called at all in our pipeline and deletions were called fully. But this is just a matter of cutoffs and implementation. All positions masked with N were ambigious positions not fullfilling the cutoff of 0.9 and therefore were correctly masked. There seems to be no IUAPC code for ambigious positions inlcuding INDELS. We are pretty sure that aligments are correct as part of the pipeline is a realignment step around INDELS to account for the BWA mapping problems. But matter of fact a fraction of reads is just not showing the deletion, but WT. We were also a bit confused at the beginning, but this seems to be correct.
1) It is not correct if NSIAD is reporting a frame shift of 5bp that is never observed when the right call is a 6bp inframe deletion. 2) I will take a look at your pipeline - thanks for sharing! 3) indel realignment is not appropriate for soft clips under the deletion that do not map to the other end of the deletion 4) Noone has recommended using 0.9 5) There seems to be mixed haplotypes - have you observed this?
Can we set up a test dataset and will you be able to post the optimal cutoffs? One thing that would help is if these became standardized to avoid going off on tangents and narratives that don't address the data.
All the best , Daniel
On Thu, Jul 15, 2021 at 7:57 AM cutpatel @.***> wrote:
We also ran into issues with gisaid submission, but as far as I can tell now, all is correct with the consensus calling. We checked the aligments in IGV and our variant calls which we do with out own pipeline MTBseq /https://github.com/ngs-fzb/MTBseq_source). We also observed 1bp Insertions as N in the consensus as well as deletions partially with Ns. Insertion was not called at all in our pipeline and deletions were called fully. But this is just a matter of cutoffs and implementation. All positions masked with N were ambigious positions not fullfilling the cutoff of 0.9 and therefore were correctly masked. There seems to be no IUAPC code for ambigious positions inlcuding INDELS. We are pretty sure that aligments are correct as part of the pipeline is a realignment step around INDELS to account for the BWA mapping problems. But matter of fact a fraction of reads is just not showing the deletion, but WT. We were also a bit confused at the beginning, but this seems to be correct.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/andersen-lab/ivar/issues/85#issuecomment-880763865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELLUIZOI763LTKKB4FECODTX3ZM3ANCNFSM4WVW4UGA .
-- Daniel J McGoldrick Ph.D University of Washington UW Genome Sciences Center (GRC), Nickerson Lab
Box 355065Seattle, WA 98195(206) 685-7342
Absolutely right, GISAID should not call such things frameshift deletions. But what can they do with only having the consensus submitted. In our case bbbbN[--------]bbbb. Looking at our variant calls, 75% frequency cutoff, we get the 9bp deletion. With ivar consensus and the 90% cutoff we get the above mentioned construct with the N, as one position is ambigious. By national regulations we are bound to this cutoff. If we would raise our own variant calling cutoff to 90% we also would only get the 8bp deletion, as all variant calls are position specific to the reference. So at a certain cutoff positions start to be ambigious. What would be your suggestion how to represent this in a consensus?
Sry, I totally missed the point with the soft clipping. And you are definetly right. How a consensus generator can deal with this?
Hi,
Have there been any updates to the issues discussed above?
We are currently using ivar for analyzing our national SARS-CoV-2 samples and have also experienced the same issues discussed above. That is, we have seen a large number of samples that get a frameshift in the consensus due to ivar consensus
incorporating “N”s in well supported deletions.
Thanks! Tanja
hi all, this should be addressed by PR #129 on branch variant_depth_issues. if anyone wants to test, that would be great!
Describe the bug
We have observed that
ivar variants
can generate false positive variant calls for genomes that contain insertions or deletions. Here is an example from a public genome that contains the 9bp orf1ab SGF deletion:Position 11287 has an exceptionally well supported deletion (nearly 5000x coverage) and three very weak substitutions within the deleted region.
The root cause of this is alignment errors around the ends of reads where bwa (or any other aligner) prefers to introduce mismatches rather than a long gap:
I believe this issue is much more likely to occur with nextera-based sequencing data as reads can start/end anywhere, and if you sequence deeply enough inevitably some reads will be difficult to align.
In this particular case the consensus sequence is correct and the spurious calls only appear in
.variants.tsv
. In other cases however I think alignment problems can introduceN
s in the consensus adjacent to, or within, the deletion which can cause issues with variant consequence prediction.This issue may be related to #79 and #83, although I think it is separate since it is driven by alignment artifacts.
To Reproduce
Run ivar using the standard pipeline on this data:
https://www.ebi.ac.uk/ena/browser/view/ERX4873573