cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

faToVcf seems to omit insertions in vcf output (input to usher) #530

Closed smsaladi closed 1 year ago

smsaladi commented 1 year ago

It seems that faToVcf's vcf output is used as the input to usher:

https://github.com/cov-lineages/pangolin/blob/2f2756e5ddd23b10cd8a1724d71e0f92c7c5b78f/pangolin/scripts/usher.smk#L109-L110

We've been working with the vcf file and were not able to find insertions with respect to the reference in this file. It seems like faToVcf might omit them. A minimal example below.

Has this come up before? Or maybe we are misunderstanding something?

Command:

./faToVcf -includeNoAltN test.fa test.vcf` 

"kent source version 453".

input: test.fa

>ref
------NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTT------
GCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGG
AGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTA
GTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCAT
CAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGG
>ref
------NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGGGGGG
GCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGG
AGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTA
GTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCAT
CAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGG

output: test.vcf

##fileformat=VCFv4.2
##reference=test.fa:ref
##source=faToVcf test.fa out.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ref
smsaladi commented 1 year ago

Actually, it looks like UShER doesn't support indels at this time, so there's no effect that insertions are dropped by faToVcf

https://github.com/yatisht/usher/issues/186

AngieHinrichs commented 1 year ago

Right, faToVcf was written [in a hurry earlier in the pandemic] specifically for UShER, and since UShER ignores indels, I haven't bothered to implement them. Sorry about that, faToVcf should have a warning message about that!

smsaladi commented 1 year ago

So grateful for you - thanks!