andersen-lab / ivar

iVar is a computational package that contains functions broadly useful for viral amplicon-based sequencing.
https://andersen-lab.github.io/ivar/html/
GNU General Public License v3.0
118 stars 40 forks source link

spurious variants caused by alignment artifacts around indels #85

Open jts opened 3 years ago

jts commented 3 years ago

Describe the bug

We have observed that ivar variants can generate false positive variant calls for genomes that contain insertions or deletions. Here is an example from a public genome that contains the 9bp orf1ab SGF deletion:

REGION      POS    REF  ALT         REF_DP  REF_RV  REF_QUAL  ALT_DP  ALT_RV  ALT_QUAL  ALT_FREQ
MN908947.3  11287  G    -TCTGGTTTT  3914    2751    46        4907    0       20        0.960274
MN908947.3  11288  T    A           2       2       34        20      2       35        0.909091
MN908947.3  11291  G    A           9       6       48        3       0       32        0.25
MN908947.3  11296  T    G           8       6       42        29      17      50        0.74359

Position 11287 has an exceptionally well supported deletion (nearly 5000x coverage) and three very weak substitutions within the deleted region.

The root cause of this is alignment errors around the ends of reads where bwa (or any other aligner) prefers to introduce mismatches rather than a long gap:

alignment_error

I believe this issue is much more likely to occur with nextera-based sequencing data as reads can start/end anywhere, and if you sequence deeply enough inevitably some reads will be difficult to align.

In this particular case the consensus sequence is correct and the spurious calls only appear in .variants.tsv. In other cases however I think alignment problems can introduce Ns in the consensus adjacent to, or within, the deletion which can cause issues with variant consequence prediction.

This issue may be related to #79 and #83, although I think it is separate since it is driven by alignment artifacts.

To Reproduce

Run ivar using the standard pipeline on this data:

https://www.ebi.ac.uk/ena/browser/view/ERX4873573

gkarthik commented 3 years ago

Hello Jared, sorry about the delay in response, we've been occupied with epi work. Thank you for pointing out this issue.

As you stated, I would expect consensus calling to work in this case. Filtering variant calls based on the position in the read might help eliminate some of these false positives. We need to think about how to implement a little more but will keep you updated!

IsabelFE commented 3 years ago

Hello,

I am having exactly the same issue around position 11287 of the SARS-CoV-2 genome. Based on the BAM file there is an obvious 9nts deletion. I see this in many genomes, and some of them had been rejected by GISAID because the consensus from ivar indicates only a 7-8nts gap.

Here you can see an example bam file on igv:

image

I calculated the consensus with several settings for mpileup and ivar consensus:

Screen Shot 2021-06-07 at 5 57 04 PM

IsabelFE commented 3 years ago

Using -f in mpileup is also helping in situations like these were a mutation is called at the very end of a read: image image image

IsabelFE commented 3 years ago

However, I am also finding situations where maybe using -f in mpileup is creating artifacts. This frameshift deletion in position 28271 is also present in many genomes. Here you can see how 948 reads support the deletion, while 176 call an A: image In this case, only the settings using -f call an A in this position, while the rest identify the gap. image I think that this gap is well supported, so I am not sure about using -f

IsabelFE commented 3 years ago

@jts, would you mind sharing what settings did you end up using for mpileup and ivar consensus? Thanks

gkarthik commented 3 years ago

Hey Isabel, any chance you could share the mpileup file for that region here? The BAQ score calculation helps but in certain cases like you pointed out introduces artifacts which is why typically I didn't include the -f flag for consensus calling but this requires a closer look on my end.

IsabelFE commented 3 years ago

Here you have the pileups for the conflictive regions using all the settings I tried:

So, BAQ is helping in the first 3 situations, but doesn't identify the indel in the 4th one.

IsabelFE commented 3 years ago

Hi, I think that the problem is coming from an inconsistency between ivar consensus and ivar variants. If I first create a pile file with samtools mpileup like this:

samtools mpileup -d 0 -q 30 -f wuhan_trim3polyA.fasta "2021_0405_01C21.sorted.bam" > "2021_0405_01C21.pile"

And them I pipe that file into both ivar consensus and ivar variants with the same settings:

< "2021_0405_01C21.pile" | ivar consensus -t 0.6 -m 10 -p "2021_0405_01C21.consensus"
< "2021_0405_01C21.pile" | ivar variants -t 0.6 -m 10 -r wuhan_trim3polyA.fasta -g GCF_009858895.2_ASM985889v3_genomic.gff -p "2021_0405_01C21.variant"

Then, the variants file looks great:

REGION  POS REF ALT REF_DP  REF_RV  REF_QUAL    ALT_DP  ALT_RV  ALT_QUAL    ALT_FREQ    TOTAL_DP    PVAL    PASS    GFF_FEATURE REF_CODON   REF_AA  ALT_CODON   ALT_AA
NC_045512.2 241 C   T   62  18  39  6507    2431    38  0.983376    6617    0   TRUE    NA  NA  NA  NA  NA
NC_045512.2 1059    C   T   15  10  35  3020    2418    32  0.993421    3040    0   TRUE    cds-YP_009724389.1  ACC T   ATC I
NC_045512.2 1059    C   T   15  10  35  3020    2418    32  0.993421    3040    0   TRUE    cds-YP_009725295.1  ACC T   ATC I
NC_045512.2 3037    C   T   76  38  34  9375    3757    35  0.988924    9480    0   TRUE    cds-YP_009724389.1  TTC F   TTT F
NC_045512.2 3037    C   T   76  38  34  9375    3757    35  0.988924    9480    0   TRUE    cds-YP_009725295.1  TTC F   TTT F
NC_045512.2 3987    C   T   6   5   34  709 368 33  0.988842    717 0   TRUE    cds-YP_009724389.1  ACA T   ATA I
NC_045512.2 3987    C   T   6   5   34  709 368 33  0.988842    717 0   TRUE    cds-YP_009725295.1  ACA T   ATA I
NC_045512.2 6101    G   A   4   0   32  35  18  33  0.897436    39  8.225e-18   TRUE    cds-YP_009724389.1  GGT G   AGT S
NC_045512.2 6101    G   A   4   0   32  35  18  33  0.897436    39  8.225e-18   TRUE    cds-YP_009725295.1  GGT G   AGT S
NC_045512.2 7201    A   G   1   1   34  3125    1260    35  0.99681 3135    0   TRUE    cds-YP_009724389.1  TCA S   TCG S
NC_045512.2 7201    A   G   1   1   34  3125    1260    35  0.99681 3135    0   TRUE    cds-YP_009725295.1  TCA S   TCG S
NC_045512.2 8809    C   T   11  2   34  2396    784 36  0.991722    2416    0   TRUE    cds-YP_009724389.1  GAC D   GAT D
NC_045512.2 8809    C   T   11  2   34  2396    784 36  0.991722    2416    0   TRUE    cds-YP_009725295.1  GAC D   GAT D
NC_045512.2 9867    T   C   2   1   34  164 109 33  0.964706    170 3.37416e-97 TRUE    cds-YP_009724389.1  CTT L   CCT P
NC_045512.2 9867    T   C   2   1   34  164 109 33  0.964706    170 3.37416e-97 TRUE    cds-YP_009725295.1  CTT L   CCT P
NC_045512.2 11287   G   -TCTGGTTTT  506 67  30  7986    0   20  0.8543  9348    0   TRUE    NA  NA  NA  NA  NA
NC_045512.2 11291   G   A   11  6   32  30  12  28  0.731707    41  2.92495e-49 TRUE    cds-YP_009724389.1  GGT G   AGT S
NC_045512.2 11291   G   A   11  6   32  30  12  28  0.731707    41  2.92495e-49 TRUE    cds-YP_009725295.1  GGT G   AGT S
NC_045512.2 11296   T   G   22  2   23  107 23  30  0.798507    134 3.44369e-213    TRUE    cds-YP_009724389.1  TTT F   TTG L
NC_045512.2 11296   T   G   22  2   23  107 23  30  0.798507    134 3.44369e-213    TRUE    cds-YP_009725295.1  TTT F   TTG L
NC_045512.2 12784   C   T   1622    544 35  4358    1900    35  0.72791 5987    0   TRUE    cds-YP_009724389.1  AAC N   AAT N
NC_045512.2 12784   C   T   1622    544 35  4358    1900    35  0.72791 5987    0   TRUE    cds-YP_009725295.1  AAC N   AAT N
NC_045512.2 14408   C   T   7   1   32  1669    681 35  0.992861    1681    0   TRUE    cds-YP_009724389.1  CCT P   CTT L
NC_045512.2 16500   A   C   0   0   0   685 334 34  0.991317    691 0   TRUE    cds-YP_009724389.1  CAA Q   CAC H
NC_045512.2 21575   C   T   4   2   34  558 250 37  0.989362    564 0   TRUE    cds-YP_009724390.1  CTT L   TTT F
NC_045512.2 21846   C   T   4   1   34  699 317 34  0.99431 703 0   TRUE    cds-YP_009724390.1  ACT T   ATT I
NC_045512.2 22021   G   T   6   3   34  606 282 36  0.985366    615 0   TRUE    cds-YP_009724390.1  ATG M   ATT I
NC_045512.2 22320   A   G   1   0   68  947 393 40  0.997893    949 0   TRUE    cds-YP_009724390.1  GAT D   GGT G
NC_045512.2 22992   G   A   22  7   35  4086    2126    35  0.989826    4128    0   TRUE    cds-YP_009724390.1  AGC S   AAC N
NC_045512.2 23403   A   G   4   0   42  5323    2094    37  0.998687    5330    0   TRUE    cds-YP_009724390.1  GAT D   GGT G
NC_045512.2 24432   A   G   4   1   30  2437    1168    37  0.997952    2442    0   TRUE    cds-YP_009724390.1  CAA Q   CGA R
NC_045512.2 25517   C   T   77  29  36  2877    1199    38  0.971959    2960    0   TRUE    cds-YP_009724391.1  CCT P   CTT L
NC_045512.2 25563   G   T   2   0   34  3107    1603    39  0.994558    3124    0   TRUE    cds-YP_009724391.1  CAG Q   CAT H
NC_045512.2 25830   T   C   12  5   40  4634    2891    36  0.995917    4653    0   TRUE    cds-YP_009724391.1  TTT F   TTC F
NC_045512.2 25968   A   G   8   3   33  5275    2922    36  0.99773 5287    0   TRUE    cds-YP_009724391.1  AAA K   AAG K
NC_045512.2 27739   C   T   31  26  36  4565    3542    34  0.992176    4601    0   TRUE    cds-YP_009724395.1  CTC L   TTC F
NC_045512.2 27925   C   T   1   0   34  570 341 33  0.991304    575 0   TRUE    cds-YP_009724396.1  ACA T   ATA I
NC_045512.2 28270   T   -A  1012    376 36  836 0   20  0.822835    1016    2.97724e-179    TRUE    NA  NA  NA  NA  NA
NC_045512.2 28311   C   T   11  1   50  1166    447 39  0.98563 1183    0   TRUE    cds-YP_009724397.2  CCC P   CTC L
NC_045512.2 28879   T   G   21  5   37  6232    2168    39  0.995527    6260    0   TRUE    cds-YP_009724397.2  AGT S   AGG R

It identifies the indels at positions 11287 and 28270 without calling those mutations at the edges of reads on 19680 and 29738 that were not well supported.

But, the consensus read fails to call the -A at position 28270. How is this possible?

drpatelh commented 3 years ago

Hi @gkarthik ! Hope you are well! I have also seen these sorts of issues prop up recently for real samples from the B.1.318 lineage that I believe contain a 9bp deletion. We only really realised after the records were rejected by GSAID. As you can see in the image below, and as mentioned earlier it does seem to be an issue with the way that standard aligners choose to introduce indels around the ends of reads - Bowtie2 in this case.

9bp_indel_igv

This discrepancy was initially observed after running viralrecon with default parameters:

nextflow run nf-core/viralrecon \
    --input samplesheet.csv \
    --platform illumina \
    --protocol amplicon \
    --genome 'MN908947.3' \
    --primer_set artic \
    --primer_set_version 3 \
    --skip_assembly \
    -r 2.1 \
    -profile singularity

Minimal test data

I have created a minimal test example for you to be able to hopefully fully reproduce the error: test_9bp_deletion.tar.gz

Tar archive contents * `environment.yml` - Conda environment file * `GCA_009858895.3_ASM985889v3_genomic.200409.gff` - GFF annotation * `nCoV-2019.reference.fasta` - MN908947.3 reference fasta * `nCoV-2019.reference.fasta.fai` * `test.ivar_trim.sorted.bam` - Sub-sampled BAM containing reads mapping around the 9bp deletion * `test.ivar_trim.sorted.bam.bai` * `test.consensus.fa` - Output from `ivar consensus` * `test.consensus.qual.txt` - Output from `ivar consensus` * `test.tsv` - Output from `ivar variants` * `multi.fa` - Multi-fasta from `cat nCoV-2019.reference.fasta test.consensus.fa > multi.fa` * `msa.clustal` - Multiple sequence alignment generated from `multi.fa` with Muscle

Steps to reproduce

  1. Install Conda environment with ivar and muscle

    conda env create -f environment.yml
  2. Run ivar variants via default parameters used in viralrecon

    samtools mpileup \
        --count-orphans \
        --no-BAQ \
        --max-depth 0 \
        --min-BQ 0 \
        --reference nCoV-2019.reference.fasta \
        test.ivar_trim.sorted.bam | \
        ivar variants \
            -t 0.25 \
            -q 20 \
            -m 10 \
            -g GCA_009858895.3_ASM985889v3_genomic.200409.gff \
            -r nCoV-2019.reference.fasta \
            -p test
  3. Run ivar consensus via default parameters used in viralrecon

    samtools mpileup \
        --count-orphans \
        --no-BAQ \
        --max-depth 0 \
        --min-BQ 0 \
        -aa \
        --reference nCoV-2019.reference.fasta \
        test.ivar_trim.sorted.bam | \
        ivar consensus \
            -t 0.75 \
            -q 20 \
            -m 10 \
            -n N \
            -p test.consensus
  4. Concatenate reference genome with consensus generated by ivar consensus

    cat nCoV-2019.reference.fasta test.consensus.fa > multi.fa
  5. Generate multiple-sequence alignment between the reference genome and the consensus generated by ivar consensus

    muscle -in multi.fa -out msa.clustal -clw
  6. Get list of deletions called by ivar variants - THESE CALLS LOOK FINE BECAUSE WE HAVE A 9BP DELETION CALLED

    $ awk '$4 ~ /^-/ { print $0; }' test.tsv
    
      MN908947.3      11287   G       -TCTGGTTTT      19490   7667    42      19120   0       20      0.853648        22398   0       TRUE    NA      NA      NA      NA      NA
  7. Get deletions in consensus by ivar consensus - THESE CALLS LOOK WRONG BECAUSE WE HAVE A 4BP DELETION CALLED ADJACENT TO 5Ns

    $ grep -B 2 "-" msa.clustal
    
    MN908947.3                            TAGTTTGTCTGGTTTTAAGCTAAAAGACTGTGTTATGTATGCATCAGCTGTAGTGTTACT
    Consensus_test.consensus_thresho      TAGTTTGNNNNN----AAGCTAAAAGACTGTGTTATGTATGCATCAGCTGTAGTGTTACT

Please let me know if you need any further information! Thanks :)

IsabelFE commented 3 years ago

Hi @drpatelh, I also ran into this because our samples were rejected by GISAID when the gap is called with 5, 7 or 8 its, instead of the in frame 9nt deletion.

charlesfoster commented 3 years ago

Hi,

I had a similar problem as @drpatelh and @IsabelFE with a Delta variant (B.1.617.2), which has overwhelming evidence for a 6 bp deletion after position 28247. However, I'm unsure if it's related to the alignment artifacts.

My initial command provided a reference to mpileup, which turns on BAQ:

samtools mpileup -A -d 0 -Q 0 --reference NC_045512.fasta sample.primertrim.sorted.bam | ivar consensus -p sample.consensus -t 0.9 -n N

REGION | POS | REF | ALT | REF_DP | REF_RV | REF_QUAL | ALT_DP | ALT_RV | ALT_QUAL | ALT_FREQ | TOTAL_DP | PVAL | PASS | GFF_FEATURE | REF_CODON | REF_AA | ALT_CODON | ALT_AA

NC_045512.2 | 28247 | A | -GATTTC | 2337 | 986 | 40 | 2233 | 0 | 20 | 0.900766 | 2479 | 0 | TRUE | NA | NA | NA | NA | NA

image

The consensus ends up with 2 gaps and 4 Ns in that region, presumably because of two weakly supported substitutions:

image

image

If I run ivar consensus with the same command, except either skipping --reference or adding the -b/--no-BAQ flag for mpileup, I get the deletions incorporated as expected, despite a weakly supported substitution within the deletion span:

If I run ivar consensus with -t 0 (to take the most common base at each site), and do not disable BAQ in mpileup, the deletion is incorporated as I would expect.

samtools mpileup -A -d 0 -Q 0 --reference NC_045512.fasta sample.primertrim.sorted.bam | ivar consensus -p sample.t0.consensus -n N

However, this is not ideal, because a further deletion at position 28274 is skipped, despite being strongly supported. The skipped single base deletion is included if I disable BAQ to mpileup.

Below is an overall summary of what I get with all the different settings, with the 'correct' seqs having red boxes. It really demonstrates the unexpected differences with/without BAQ in the mpileup step.

image

Thanks, Charles

PovilasMat commented 3 years ago

I am experiencing similar issue as @charlesfoster, @drpatelh and @IsabelFE. Currently spreading variants have common deletions at 21765-21770, 11288-11296, 21991-21993 and other positions. I only realised it after GISAID rejected some of our sequences. In my limited analysis, it is mostly caused by some of the reads being incorrectly mapped to the deletion region rather than the sequence before or after. I have noticed the same thing with ivar variants vs ivar consensus as @IsabelFE. Variants using same parameters as consensus identify deletions correctly.

dmcgoldrick commented 3 years ago

I have seen errors and inconsistencies for indels as well e.g. 6bp INFRAME deletion at position 21766 being called a Gap of 5 nucleotide(s) (FRAMESHIFT). It is not a frameshift this is an inframe deletion.

Also, I have seen softclips under deletions that do not map to the other side of the deletion.

For small deletion calls one thing that seems easy is to make sure only reads embedded in the read sequence are used in the mpileup for calling. This can be done by limiting reads with cigar strings that match the regexp \d+M\d+D\d+M - later maybe the ones that have ends or soft-clips under the deletion can be inspected/added (similarly to things under primers) and for very large deletions maybe assembly can push the variation calling farther - I have tried these secondary steps but they extend beyond the scope of reference variant calling.

more stuff:

Example Read Sets:


bbbbbbb------bbbbbbb

bbbbbbbbsss bbbbbbbbssss -------------ssbbbbbbb

bbbbbbbbbbbbbbbbbbb


pileup: 4444444311113333333 5555555555555555555

freq=0.5

=> Wrong answer for consensus/variants bbbbbbbb-----bbbbbbb

because bbbbbbbb-----bbbbbbb

is NEVER observed!

in the case of our 5bp deletion, not a single read had that 5bp embedded in the cigar string as [0-9]+M[0-9]+D[0-9]+M!

[mcgold@grc002 Covid19]$ samtools view 768230.merged.trimmed.sorted.markeddups.bam | grep '5D' | grep -P 'GGTTCC|CCCAGAG' => nothing

many had [0-9]M6D[0-9]M and they match the indel flanks

[mcgold@grc002 Covid19]$ samtools view 768230.merged.trimmed.sorted.markeddups.bam | grep '6D' | grep -P 'GGTTCC|CCCAGAG'

MC:Z:16M6D58M MD:Z:16^TACATG58 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:62 XS:i:0 VH00123:1:AAAKJW7M5:1:2309:57636:13930 99 NC_045512.2 21749 60 16M6D70M15S = 21749 92 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCCTGTCTCTTATACAC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:15S16M6D70M MD:Z:16^TACATG70 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:74 XS:i:0 VH00123:1:AAAKJW7M5:1:2403:42393:43633 1187 NC_045512.2 21749 60 16M6D54M = 21749 76 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATG CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:16M6D54M MD:Z:16^TACATG54 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:58 XS:i:0 VH00123:1:AAAKJW7M5:1:2408:29952:32521 99 NC_045512.2 21749 60 16M6D85M = 21889 241 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCC MC:Z:101M MD:Z:16^TACATG85 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:89 XS:i:0 VH00123:1:AAAKJW7M5:1:2604:53697:12681 1123 NC_045512.2 21749 60 16M6D81M4S = 21749 103 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGACTGT CCCCCCCCCCC;CCCCCCCC;CC;C;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:4S16M6D81M MD:Z:16^TACATG81 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:85 XS:i:0 VH00123:1:AAAKJW7M5:1:2611:57011:1189 163 NC_045512.2 21749 60 16M6D85M = 21769 121 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTC CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC MC:Z:101M MD:Z:16^TACATG85 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:89 XS:i:0 VH00123:1:AAAKJW7M5:1:1109:51179:47760 147 NC_045512.2 21749 60 4S16M6D81M = 21749 -103 ACAGACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGA CCCCCCCCCCCCCCCCCC;C-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC MC:Z:16M6D81M4S MD:Z:16^TACATG81 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:85 XS:i:0 VH00123:1:AAAKJW7M5:1:1201:43832:35001 1107 NC_045512.2 21749 60 16M6D37M = 21749 -59 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCC CCCCCCCCCCCCCCCC-C-C;CCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCC MC:Z:16M6D37M MD:Z:16^TACATG37 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:41 XS:i:0 VH00123:1:AAAKJW7M5:1:1201:38928:35512 1107 NC_045512.2 21749 60 16M6D37M = 21749 -59 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCC CCCCCCCCCCCCCC;C-CCCCCCCCCCCCCCCCCCCCCCCCC-CCCCCCCCCC MC:Z:16M6D37M MD:Z:16^TACATG37 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:41 XS:i:0 VH00123:1:AAAKJW7M5:1:1305:22719:38882 147 NC_045512.2 21749 60 16M6D37M = 21749 -59 ACTTGGTTCCATGCTATCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCC CCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCC MC:Z:16M6D37M MD:Z:16^TACATG37 RG:Z:AAAKJW7M5.1.GGATACCAGA.GTAAGCAACG NM:i:6 MQ:i:60 AS:i:41 XS:i:0

So it was a 6bp deletion not a 5bp frameshift

cutpatel commented 3 years ago

We also ran into issues with gisaid submission, but as far as I can tell now, all is correct with the consensus calling. We checked the aligments in IGV and our variant calls which we do with out own pipeline MTBseq /https://github.com/ngs-fzb/MTBseq_source). We also observed 1bp Insertions as N in the consensus as well as deletions partially with Ns. Insertion was not called at all in our pipeline and deletions were called fully. But this is just a matter of cutoffs and implementation. All positions masked with N were ambigious positions not fullfilling the cutoff of 0.9 and therefore were correctly masked. There seems to be no IUAPC code for ambigious positions inlcuding INDELS. We are pretty sure that aligments are correct as part of the pipeline is a realignment step around INDELS to account for the BWA mapping problems. But matter of fact a fraction of reads is just not showing the deletion, but WT. We were also a bit confused at the beginning, but this seems to be correct.

dmcgoldrick commented 3 years ago

1) It is not correct if NSIAD is reporting a frame shift of 5bp that is never observed when the right call is a 6bp inframe deletion. 2) I will take a look at your pipeline - thanks for sharing! 3) indel realignment is not appropriate for soft clips under the deletion that do not map to the other end of the deletion 4) Noone has recommended using 0.9 5) There seems to be mixed haplotypes - have you observed this?

Can we set up a test dataset and will you be able to post the optimal cutoffs? One thing that would help is if these became standardized to avoid going off on tangents and narratives that don't address the data.

All the best , Daniel

On Thu, Jul 15, 2021 at 7:57 AM cutpatel @.***> wrote:

We also ran into issues with gisaid submission, but as far as I can tell now, all is correct with the consensus calling. We checked the aligments in IGV and our variant calls which we do with out own pipeline MTBseq /https://github.com/ngs-fzb/MTBseq_source). We also observed 1bp Insertions as N in the consensus as well as deletions partially with Ns. Insertion was not called at all in our pipeline and deletions were called fully. But this is just a matter of cutoffs and implementation. All positions masked with N were ambigious positions not fullfilling the cutoff of 0.9 and therefore were correctly masked. There seems to be no IUAPC code for ambigious positions inlcuding INDELS. We are pretty sure that aligments are correct as part of the pipeline is a realignment step around INDELS to account for the BWA mapping problems. But matter of fact a fraction of reads is just not showing the deletion, but WT. We were also a bit confused at the beginning, but this seems to be correct.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/andersen-lab/ivar/issues/85#issuecomment-880763865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELLUIZOI763LTKKB4FECODTX3ZM3ANCNFSM4WVW4UGA .

-- Daniel J McGoldrick Ph.D University of Washington UW Genome Sciences Center (GRC), Nickerson Lab

Box 355065Seattle, WA 98195(206) 685-7342

cutpatel commented 3 years ago

Absolutely right, GISAID should not call such things frameshift deletions. But what can they do with only having the consensus submitted. In our case bbbbN[--------]bbbb. Looking at our variant calls, 75% frequency cutoff, we get the 9bp deletion. With ivar consensus and the 90% cutoff we get the above mentioned construct with the N, as one position is ambigious. By national regulations we are bound to this cutoff. If we would raise our own variant calling cutoff to 90% we also would only get the 8bp deletion, as all variant calls are position specific to the reference. So at a certain cutoff positions start to be ambigious. What would be your suggestion how to represent this in a consensus?

Sry, I totally missed the point with the soft clipping. And you are definetly right. How a consensus generator can deal with this?

talnor commented 2 years ago

Hi,

Have there been any updates to the issues discussed above?

We are currently using ivar for analyzing our national SARS-CoV-2 samples and have also experienced the same issues discussed above. That is, we have seen a large number of samples that get a frameshift in the consensus due to ivar consensus incorporating “N”s in well supported deletions.

Thanks! Tanja

cmaceves commented 2 years ago

hi all, this should be addressed by PR #129 on branch variant_depth_issues. if anyone wants to test, that would be great!