brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

vcfanno doesn't annotate sites that are polymorphic in query vcf but fixed for reference allele in annotation vcf #152

Open AaronRuben opened 1 year ago

AaronRuben commented 1 year ago

Hi Brent,

I was trying to annotate 1KGP VCFs with genotype information of archaic hominins (e.g., Altai Neanderthal). These individuals have a lot of sites that are homozygous for the reference allele, for example:

20 60343 . G .

while this site is polymorphic in 1KGP:

20 60343 . G A

These sites match but a currently not annotated unless the --permissive-overlap flag is set, which isn't ideal. I know this is an edge case, and I can't simply merge the VCFs because the inclusion of archaic hominins would mess up downstream steps.

Would be possible to handle such cases in future?

Thanks, Aaron

brentp commented 1 year ago

Hi Aaron, the only way to do this is with --permissive-overlap as you note. I think that's the correct behavior as "G ." should not match with G A". if the are homozygous reference only, then the more correct would be "G G".

AaronRuben commented 1 year ago

Hi Brent,

Thanks for the quick response.

If it would be "G G", it would still not match with "G A". I also think "G ." makes more sense, as there is no alternative allele.

In either case, it would be great to allow matches of polymorphic and monomorphic sites (whether denoted by "G ." or "G G") when the reference alleles match.