artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 68 forks source link

Duplicate variants in VCF outputs causing bcftools consensus error #53

Open DarianHole opened 3 years ago

DarianHole commented 3 years ago

Hi,

I've run into what seems to be a rare issue (first time I've seen it in 100's of samples) where there are duplicate variants in my *.pass.vcf.gz and *.merged.vcf file which is leading to bcftools consensus to generate an error with an output of:

The site MN908947.3:27967 overlaps with another variant, skipping...
The fasta sequence does not match the REF allele at MN908947.3:28250:
     .vcf: [T]
     .vcf: [TCTG] <- (ALT)
     .fa:  [G]TTCATCTAAACGAACAAACTAAAATGTCTGATAATGGACCCCAAAATCAGCGAAATGCA...

I also get the output from vcf_merge of:

Found primer binding site mismatch: {}
Found primer binding site mismatch: {}

Issue is similar to #21 from the looks of it although I checked and I am on the latest version so it must be something else.

For reference I'm currently using the following:

Sorry I don't know too much more, I'll keep looking at it in the meantime. Thanks for any help you can offer and sorry if I missed a solution somewhere,

Darian

nickloman commented 3 years ago

Thanks for the report!

So this can happen occasionally and it's not clear what can be easily done about it. Typically this will be if you have two competing edits in your VCF file. Can you check the pass and fail VCFs and look around 27967 and see what variants are reported?

DarianHole commented 3 years ago

Thanks for the quick response!

Fails vcf has nothing but the one variant:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample
MN908947.3  24981   .   CT  C   564.5   PASS    TotalReads=990;SupportFraction=0.536596;SupportFractionByStrand=0.683247,0.391125;BaseCalledReadsWithVariant=304;BaseCalledFraction=0.304;AlleleCount=1;StrandSupport=156,303,337,194;StrandFisherTest=196;SOR=1.181;RefContext=TGATCCTTTGCA;Pool=nCoV-2019_1   GT  1

Pass vcf has a duplicate at that spot along with all variants after it. These duplicates are also seen in merged.vcf:

...
MN908947.3  25563   .   G   T   11645.8 PASS    TotalReads=990;SupportFraction=0.989544;SupportFractionByStrand=0.989015,0.990068;BaseCalledReadsWithVariant=883;BaseCalledFraction=0.883;AlleleCount=1;StrandSupport=5,5,488,492;StrandFisherTest=1;SOR=0.628354;RefContext=TTTCAGAGCGC;Pool=nCoV-2019_1   GT  1
MN908947.3  27967   .   G   T   10312.8 PASS    TotalReads=992;SupportFraction=0.963149;SupportFractionByStrand=0.978876,0.947486;BaseCalledReadsWithVariant=864;BaseCalledFraction=0.864;AlleleCount=1;StrandSupport=10,26,485,471;StrandFisherTest=20;SOR=0.213;RefContext=GTCATGTACTC;Pool=nCoV-2019_1   GT  1
MN908947.3  27967   .   G   T   10284.5 PASS    TotalReads=985;SupportFraction=0.966758;SupportFractionByStrand=0.967941,0.965573;BaseCalledReadsWithVariant=826;BaseCalledFraction=0.827655;AlleleCount=1;StrandSupport=16,17,477,475;StrandFisherTest=1;SOR=0.634787;RefContext=GTCATGTACTC;Pool=nCoV-2019_2  GT  1
MN908947.3  28250   .   T   TCTG    17682.8 PASS    TotalReads=991;SupportFraction=0.964904;SupportFractionByStrand=0.956657,0.973135;BaseCalledReadsWithVariant=804;BaseCalledFraction=0.804;AlleleCount=1;StrandSupport=21,13,474,483;StrandFisherTest=8;SOR=0.368682;RefContext=TTAGATTTCAT;Pool=nCoV-2019_1 GT  1
MN908947.3  28250   .   T   TCTG    17052.5 PASS    TotalReads=937;SupportFraction=0.96103;SupportFractionByStrand=0.960555,0.961519;BaseCalledReadsWithVariant=742;BaseCalledFraction=0.781053;AlleleCount=1;StrandSupport=19,18,456,444;StrandFisherTest=0;SOR=0.670323;RefContext=TTAGATTTCAT;Pool=nCoV-2019_2   GT  1
MN908947.3  28253   .   CA  C   17682.8 PASS    TotalReads=991;SupportFraction=0.984457;SupportFractionByStrand=0.973536,0.995355;BaseCalledReadsWithVariant=834;BaseCalledFraction=0.834;AlleleCount=1;StrandSupport=13,2,482,494;StrandFisherTest=24;SOR=0.0992646;RefContext=GATTTCATCTAA;Pool=nCoV-2019_1   GT  1
MN908947.3  28253   .   CA  C   17052.5 PASS    TotalReads=937;SupportFraction=0.985751;SupportFractionByStrand=0.980644,0.991002;BaseCalledReadsWithVariant=783;BaseCalledFraction=0.824211;AlleleCount=1;StrandSupport=9,4,466,458;StrandFisherTest=6;SOR=0.234992;RefContext=GATTTCATCTAA;Pool=nCoV-2019_2   GT  1
MN908947.3  28310   .   C   T   11299.3 PASS    TotalReads=990;SupportFraction=0.948394;SupportFractionByStrand=0.957024,0.939764;BaseCalledReadsWithVariant=749;BaseCalledFraction=0.749;AlleleCount=1;StrandSupport=21,30,474,465;StrandFisherTest=5;SOR=0.444288;RefContext=ATGCACCCCGC;Pool=nCoV-2019_1 GT  1
MN908947.3  28310   .   C   T   11198.6 PASS    TotalReads=925;SupportFraction=0.953539;SupportFractionByStrand=0.962119,0.944599;BaseCalledReadsWithVariant=696;BaseCalledFraction=0.741214;AlleleCount=1;StrandSupport=18,25,454,428;StrandFisherTest=7;SOR=0.500405;RefContext=ATGCACCCCGC;Pool=nCoV-2019_2  GT  1
nickloman commented 3 years ago

OK, this probably isn't the issue I was thinking about, but more relates to the duplicate indel at 28250. I'll check this is the case and it should be easy to fix. In the meantime if you remove one of those duplicate lines at 28250 in the VCF you might find that bcftools consensus can be run manually to generate a consensus genome.

DarianHole commented 3 years ago

You are correct removing one of the spots at 28250 allowed me to generate a consensus genome! Thanks!

[W::hts_idx_load3] The index file is older than the data file: nml_barcode21.pass.vcf.gz.tbi
Note: the --sample option not given, applying all records regardless of the genotype
The site MN908947.3:27967 overlaps with another variant, skipping...
The site MN908947.3:28253 overlaps with another variant, skipping...
The site MN908947.3:28310 overlaps with another variant, skipping...
Applied 16 variants
nickloman commented 3 years ago

@will-rowe I think we need an enhancement which is to systematically remove duplicate variants present in both pools before calling bcftools consensus.

DarianHole commented 3 years ago

I've been messing around a bit more with bcftools consensus and even if I do not remove any of the duplicates, I can still create a consensus file with the normal command:

bcftools consensus -f barcode21.preconsensus.fasta barcode21.pass.vcf.gz -m barcode21.coverage_mask.txt -o barcode21.consensus.fasta

So something before must be causing the issue leading up to this one

KatSteinke commented 2 years ago

Is there any chance it could be related to this issue? It got resolved in bcftools 1.11 from the release notes, but ARTIC is using a lower version...