brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Proposal for `-slightly-permission-overlap` matching pos + REF #92

Closed chapmanb closed 6 years ago

chapmanb commented 6 years ago

Brent; @cariaso and I have been working on annotating all position reference based calls with dbSNP rs IDs using vcfanno. We're starting with gVCFs from GATK4 called using --emit-ref-confidence BP_RESOLUTION which give outputs at reference 0/0 positions that look like:

1 205028226 . T <NON_REF> 

When running vcfanno with a dbSNP VCF none of the reference calls get annotated with the rsIDs because the ALTs don't match with the NON_REF. We'd like to be able to associate with SNP positions even when we don't have calls there.

To do this, we swapped to using -permissive-overlap, which mostly works but also confusingly annotates at deletions like:

1 205028226 rs398122387 TG T 

since the positions overlap but the non-padded source of the deletion does not.

Have you run into this issue and have any advice/suggestions for how best to use vcfanno? We had thought of an approach to allow a new matching criteria with position + REF only, which Mike named -slightly-permissive-overlap. What do you think about that approach? Any other ideas for how to better accomplish what we're trying to do? Thanks much.

brentp commented 6 years ago

is it always the literal <NON_REF> ? If so, perhaps that could be a special case where any alternate allele will match (including, for example a variant that was T -> TC for your exapmle above).

-permissive-overlap is doing exactly what it should in the case you describe. I wasn't sure if your "confusingly" was referring to the behavior of vcfanno or the appearance of the result to the user.

chapmanb commented 6 years ago

Brent; Having <NON_REF> be an ALT wildcard would work great. Matching to same position insertions works okay, it's mainly the same position deletions that are off due to the padding bases used in the VCF representation.

Sorry for not writing clearly above. vcfanno -permissive-overlap is doing the right thing, it's just that the outcomes are confusing/misleading when not considering the REF allele.

Thanks again for considering this.

brentp commented 6 years ago

I have a simple change that makes this the default. I'm doing some testing to make sure it doesn't break anything and will make a new release whem I'm sure it doesn't.

brentp commented 6 years ago

this seems to be working. there's an impending release of go that gives about a 3-4% performance improvement over the current version. I'll wait for that to release the next vcfanno version.

chapmanb commented 6 years ago

Brent; Awesome, thanks so much. I'll roll a new bioconda package and test and soon as the more flexible and faster vcfanno gets released. Thanks again.

brentp commented 6 years ago

this is fixed in latest release. v0.3.0

chapmanb commented 6 years ago

Brilliant -- thanks so much Brent. I've updated the bioconda recipe so this is now available there. Much appreciated.