brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
364 stars 56 forks source link

Annotations from Genotype fields #15

Open dgaston opened 8 years ago

dgaston commented 8 years ago

Hey Brent,

I've been testing this out recently and I really love how fast it adds the annotations. I have a particular use case, and it may be something you don't want to support, but I thought I would float it because it could be more generally useful. I've moved my day to day work from Mendelian disease to cancer, and one thing I'm doing a lot of is using multiple variant callers to call variants in our tumour-only samples. I then merge calls and add annotations, but when evaluating variants I want to evaluate things like depth and other QC metrics on a caller-by-caller basis. Right now I get the required info from the original normalized caller-produced VCF using cyvcf2. Most callers stick everything I need in the INFO string fields, but a few only put DP, AD, etc in the genotype field. It's a really specific use case, because you need to generate vcfanno conf files for each sample (although this is easy to script in an automated workflow) but I thought I would suggest it. It seems like vcfanno, if it could parse out those fields, would do my entire annotation workflow much quicker and in fewer overall steps than my current workflow.

brentp commented 8 years ago

Is it possible that adding functionality to cyvcf2 would simplify your pipeline? If that has missing features, I'd be happy to add them there.

Likely it's outside the scope of what I want to support with vcfanno which already has enough complexity; but I'll leave this open in case a solution comes to mind. I had thought that extending the lua interface so the user had full access to the variant (including samples) would be pretty cool.

dgaston commented 8 years ago

I think cyvcf2 already meets my needs in this regard, although there would be performance tweaks that would be possible (like streaming through all files in lockstep variant by variant). I am currently porting some code I wrote for taking variants from a GEMINI database, using cyvcf2 to add back in data from the individual caller VCFs, and then tossing everything into a Cassandra database to read through a snpEff + vcfanno annotated VCF instead of a snpEff + GEMINI database. While I was doing this I randomly had the thought of using vcfanno to bring in the caller-specific data during the earlier annotation stages.

Right now I also pre-parse the caller-specific vcfs with cyvcf2 into dictionaries. Because I am working with results from small targeted panels it is efficient but it won't scale to larger VCFs. I was running into some issues just using the original cyvcf to do tabix-based retrievals of the positions.

dgaston commented 8 years ago

Definitely extending the lua interface in that way would open up interesting possibilities. And let people add complexity on the user end versus within the core tool. And I get the point of not wanting to add extra complexity. I appreciate what you've been doing parsing out functionality from GEMINI into stand-alone tools and modules like this.

brentp commented 8 years ago

I think one way to do this would be to have

[[genotype]]

sections that are analogous to the `[[annotation]] sections that will write to the INFO (not to the samples)

e.g.

`[[genotype]]`
fields=['GT', 'GQ']
name=['confident_homref']
op=`lua:count_homref_with_high_gq(GT, GQ, 10)`
type='Integer'

this would incur a performance penalty from parsing genotypes, but would be a nice addition, I think.

then, it'll be useful to access genotype and info fields in the same block... so maybe requesting a genotype field would look like.

fields=['AMR_AF', 'genotype_DP', 'genotype_GQ']

where AMRAF is from the INFO and the `genotype-prefixed are from teh samples. The genotype fields would always return an array. With this setup, genotype and INFO fields could be intermixed in the original[[annotation]]` section...

dgaston commented 8 years ago

Yeah, that actually sounds like a good implementation. For my use case with this I'm always pulling from single sample VCFs anyway, and even with the performance hit it would be much faster for my overall workflow I think. Right now I am doing a merge on a single sample but multiple callers and annotating the merged VCF with GATK, snpEff, and vcfanno. Then while loading into a Cassandra database for storage I am re-looking up the caller-specific info from the individual caller VCFs. I can use VCFanno to get all of the things from the INFO field from those VCFs during the annotation stage but still need to do the lookups for DP, AD, etc for the callers that don't also put this an INFO field. What I love about VCFanno is the speed it adds annotations as well as how easy it is to add any VCF or BED files I may find or generate. I've found adding in a BED file for my amplicon targets with VCFanno is actually much faster than later separately using bedtools or pybedtools to filter on-target versus off-target calls for instance. Just makes a really nice workflow.

vladsavelyev commented 6 years ago

+1 to this request - it would be really great to have sample fields available for processing.

ohofmann commented 6 years ago

Adding a +1 to this, see our discussion with the PCGR developer at https://github.com/sigven/pcgr/issues/12. vcfanno seems ideally suited to sidestep the issue of having to develop a custom framework to clean up / postprocess different caller VCFs, particularly since we could adjust differences in, say, AF calculation on the fly.

multimeric commented 6 years ago

+1 to this

brentp commented 6 years ago

this is on my radar. but, I'm not sure this is something for vcfanno which is already quite complex.

I'll let the idea of another tool marinate any comments on functionality would be appreciated and would drive the features.