artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 69 forks source link

Overlapping variants in fail.vcf and pass.vcf cause preconsensus to be masked and thus bcftools complains when applying the "pass" variant #125

Open Sam-Sims opened 1 year ago

Sam-Sims commented 1 year ago

Hello,

I think there are edge cases that result in overlapping variants in both the pass and fail vcf files causing bcftools to complain and artic minion to fail:

My fail.vcf contains a variant at 21640:

MN908947.3 21640 . TG GA 65 PASS DP=43;AC=0,2;AM=41;MC=0;MF=0.0;MB=0.0;AQ=3.65;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:DP:PS:UG:UQ 0/1:32:43:.:0/1:31.97

Meanwhile the pass.vcf contains a 9nt deletion at 21632, overlapping 21640:

MN908947.3 21632 . TTACCCCCTG T 500 PASS DP=43;AC=0,43;AM=0;MC=0;MF=0.0;MB=0.0;AQ=28.59;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:DP:PS:UG:UQ 1/1:137:43:.:1/1:136.04

Due to the variant being present in the fail.vcf this is maksed in the preconsensus when artic_mask is run and so TG becomes NN.

When bcftools is run it expects to find TTACCCCCTG since that is the variant in the pass.vcf file - but instead is finding the masked TTACCCCCNN in the preconsensus and so complains:

The fasta sequence does not match the REF allele at MN908947.3:21632:    
   REF .vcf: [TTACCCCCTG]  
   ALT .vcf: [T]
   REF .fa : [TTACCCCCNN]

As a temp workaround I have masked the variant in the pass.vcf and re-run bcftools manually to generate a consensus sequence. Do you have any reccomendations for these edge cases?

Sam-Sims commented 1 year ago

Looking into this a bit further it appears that in this sample amplicon 72 has dropped - which is in pool 2.

Looking at the *.2.vcf file generated from medaka variant ./primer-schemes/scov2/V4.1/scov2.reference.fasta samplename.2.hdf samplename.2.vcf it contains 2 variants from the dropped amplicon 72 that overlap with the deletion found in *.1.vcf:

MN908947.3  21634   .   ACCCC   GTGG    9.798   PASS    .   GT:GQ   1:10
MN908947.3  21640   .   TG  GA  6.765   PASS    .   GT:GQ   1:7

These variants are not found in the *.1.vcf so I assume are just artifacts coming from the dropped amplicon.

Interestingly when running with the --no-longshot flag and instead use medaka tools annotate you can see the depth of each variant is 1 in the *.2.vcf which is to be expected since they came from the dropped amplicon. They are then filtered out correctly and minion completes with no errors.

MN908947.3  21634   .   ACCCC   GTGG    9.798   PASS    DP=1.0;DPS=1.0,0.0  GT:GQ   1:10
MN908947.3  21640   .   TG  GA  6.765   PASS    DP=1.0;DPS=1.0,0.0  GT:GQ   1:7

However if longshot is used it seems like the variant at 21640 is left in and so the merged.vcf contains the following entry with a depth of 43 .

MN908947.3  21640   .   TG  GA  65  PASS    DP=43;AC=0,2;AM=41;MC=0;MF=0.000;MB=0.000;AQ=3.65;GM=1;PH=6.02,6.02,6.02,6.02;SC=None;  GT:GQ:DP:PS:UG:UQ   0/1:32:43:.:0/1:31.97

This is written to the fail.vcf still - I am guessing because of the quality score (?) and then causes the issue detailed above.

Perhaps this is a bug with longshot when calculating the depth of the variant as I am unsure where the depth of 43 has come from - nothing in the BAM indicates why this would be.

Running minion with --strict works too, but the deletion is omitted from the consensus.

Admittedly this does seem to be an edge case resulting from a dropped amplicon - which isnt ideal in the first place.

knewl commented 4 months ago

Hello,

I am getting the above error in ARTIC v1.2.4 in bcftools consensus due to overlapping 'pass' and 'fail' variants.

 Running: bcftools consensus -f sample_name.preconsensus.fasta sample_name.pass.vcf.gz -m sample_name.coverage_mask.txt -o sample_name.consensus.fasta

  The fasta sequence does not match the REF allele at MN908947.3:21632:
     REF .vcf: [TTACCCCCTG]
     ALT .vcf: [T]
     REF .fa : [NNNNNNNNNN]CATAC

This is caused by an 11bp deletion in fail.vcf overlapping with a 10bp deletion in pass.vcf. The 11bp deletion is added to the mask and thus the preconsensus contains Ns which overlap with the variant which passes.

fail.vcf:

MN908947.3      21631   .       ATTACCCCCTG     A       309.485 PASS    DP=13.0;DPS=5.0,8.0;Pool=2      GT:GQ   1:309

pass.vcf:

MN908947.3      21632   .       TTACCCCCTG      T       289.132 PASS    DP=45.0;DPS=13.0,32.0;Pool=1    GT:GQ   1:289

This is a very similar issue to #21 which has been fixed . The difference is that in this case the variants do not have exactly the same starting position, they simply overlap. A solution would be to ignore failed variants which overlap with a variant which passes.