Open DarianHole opened 3 years ago
Thanks for the report!
So this can happen occasionally and it's not clear what can be easily done about it. Typically this will be if you have two competing edits in your VCF file. Can you check the pass and fail VCFs and look around 27967 and see what variants are reported?
Thanks for the quick response!
Fails vcf has nothing but the one variant:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample
MN908947.3 24981 . CT C 564.5 PASS TotalReads=990;SupportFraction=0.536596;SupportFractionByStrand=0.683247,0.391125;BaseCalledReadsWithVariant=304;BaseCalledFraction=0.304;AlleleCount=1;StrandSupport=156,303,337,194;StrandFisherTest=196;SOR=1.181;RefContext=TGATCCTTTGCA;Pool=nCoV-2019_1 GT 1
Pass vcf has a duplicate at that spot along with all variants after it. These duplicates are also seen in merged.vcf:
...
MN908947.3 25563 . G T 11645.8 PASS TotalReads=990;SupportFraction=0.989544;SupportFractionByStrand=0.989015,0.990068;BaseCalledReadsWithVariant=883;BaseCalledFraction=0.883;AlleleCount=1;StrandSupport=5,5,488,492;StrandFisherTest=1;SOR=0.628354;RefContext=TTTCAGAGCGC;Pool=nCoV-2019_1 GT 1
MN908947.3 27967 . G T 10312.8 PASS TotalReads=992;SupportFraction=0.963149;SupportFractionByStrand=0.978876,0.947486;BaseCalledReadsWithVariant=864;BaseCalledFraction=0.864;AlleleCount=1;StrandSupport=10,26,485,471;StrandFisherTest=20;SOR=0.213;RefContext=GTCATGTACTC;Pool=nCoV-2019_1 GT 1
MN908947.3 27967 . G T 10284.5 PASS TotalReads=985;SupportFraction=0.966758;SupportFractionByStrand=0.967941,0.965573;BaseCalledReadsWithVariant=826;BaseCalledFraction=0.827655;AlleleCount=1;StrandSupport=16,17,477,475;StrandFisherTest=1;SOR=0.634787;RefContext=GTCATGTACTC;Pool=nCoV-2019_2 GT 1
MN908947.3 28250 . T TCTG 17682.8 PASS TotalReads=991;SupportFraction=0.964904;SupportFractionByStrand=0.956657,0.973135;BaseCalledReadsWithVariant=804;BaseCalledFraction=0.804;AlleleCount=1;StrandSupport=21,13,474,483;StrandFisherTest=8;SOR=0.368682;RefContext=TTAGATTTCAT;Pool=nCoV-2019_1 GT 1
MN908947.3 28250 . T TCTG 17052.5 PASS TotalReads=937;SupportFraction=0.96103;SupportFractionByStrand=0.960555,0.961519;BaseCalledReadsWithVariant=742;BaseCalledFraction=0.781053;AlleleCount=1;StrandSupport=19,18,456,444;StrandFisherTest=0;SOR=0.670323;RefContext=TTAGATTTCAT;Pool=nCoV-2019_2 GT 1
MN908947.3 28253 . CA C 17682.8 PASS TotalReads=991;SupportFraction=0.984457;SupportFractionByStrand=0.973536,0.995355;BaseCalledReadsWithVariant=834;BaseCalledFraction=0.834;AlleleCount=1;StrandSupport=13,2,482,494;StrandFisherTest=24;SOR=0.0992646;RefContext=GATTTCATCTAA;Pool=nCoV-2019_1 GT 1
MN908947.3 28253 . CA C 17052.5 PASS TotalReads=937;SupportFraction=0.985751;SupportFractionByStrand=0.980644,0.991002;BaseCalledReadsWithVariant=783;BaseCalledFraction=0.824211;AlleleCount=1;StrandSupport=9,4,466,458;StrandFisherTest=6;SOR=0.234992;RefContext=GATTTCATCTAA;Pool=nCoV-2019_2 GT 1
MN908947.3 28310 . C T 11299.3 PASS TotalReads=990;SupportFraction=0.948394;SupportFractionByStrand=0.957024,0.939764;BaseCalledReadsWithVariant=749;BaseCalledFraction=0.749;AlleleCount=1;StrandSupport=21,30,474,465;StrandFisherTest=5;SOR=0.444288;RefContext=ATGCACCCCGC;Pool=nCoV-2019_1 GT 1
MN908947.3 28310 . C T 11198.6 PASS TotalReads=925;SupportFraction=0.953539;SupportFractionByStrand=0.962119,0.944599;BaseCalledReadsWithVariant=696;BaseCalledFraction=0.741214;AlleleCount=1;StrandSupport=18,25,454,428;StrandFisherTest=7;SOR=0.500405;RefContext=ATGCACCCCGC;Pool=nCoV-2019_2 GT 1
OK, this probably isn't the issue I was thinking about, but more relates to the duplicate indel at 28250. I'll check this is the case and it should be easy to fix. In the meantime if you remove one of those duplicate lines at 28250 in the VCF you might find that bcftools consensus can be run manually to generate a consensus genome.
You are correct removing one of the spots at 28250 allowed me to generate a consensus genome! Thanks!
[W::hts_idx_load3] The index file is older than the data file: nml_barcode21.pass.vcf.gz.tbi
Note: the --sample option not given, applying all records regardless of the genotype
The site MN908947.3:27967 overlaps with another variant, skipping...
The site MN908947.3:28253 overlaps with another variant, skipping...
The site MN908947.3:28310 overlaps with another variant, skipping...
Applied 16 variants
@will-rowe I think we need an enhancement which is to systematically remove duplicate variants present in both pools before calling bcftools consensus
.
I've been messing around a bit more with bcftools consensus and even if I do not remove any of the duplicates, I can still create a consensus file with the normal command:
bcftools consensus -f barcode21.preconsensus.fasta barcode21.pass.vcf.gz -m barcode21.coverage_mask.txt -o barcode21.consensus.fasta
So something before must be causing the issue leading up to this one
Is there any chance it could be related to this issue? It got resolved in bcftools 1.11 from the release notes, but ARTIC is using a lower version...
Hi,
I've run into what seems to be a rare issue (first time I've seen it in 100's of samples) where there are duplicate variants in my
*.pass.vcf.gz
and*.merged.vcf
file which is leading to bcftools consensus to generate an error with an output of:I also get the output from vcf_merge of:
Issue is similar to #21 from the looks of it although I checked and I am on the latest version so it must be something else.
For reference I'm currently using the following:
Sorry I don't know too much more, I'll keep looking at it in the meantime. Thanks for any help you can offer and sorry if I missed a solution somewhere,
Darian