brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

GTF annotation fails: index out of error #59

Open mpschr opened 7 years ago

mpschr commented 7 years ago

Hi

I am trying to use a GTF file to annotate the VCF. I do not find a use case so I just tried with my best guess:

[[annotation]]
file="/home/mpschr/bin/bcbionextgen/data/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf.gz"
fields = ["gene_name"]
ops = ["self"]
names = ["gene_name"]

This generates a index out of error.

mpschr/Documents/projects/rnaseq-savar/test-data/PE03_ID.pax5-exons.standalone-variants.vcf 

=============================================
vcfanno version 0.2.2 [built with go1.8]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:114: found 16 sources from 4 files
panic: runtime error: index out of range

goroutine 108 [running]:
github.com/brentp/vcfanno/api.collect(0x7f48719e7058, 0xc420629440, 0xc42044f840, 0x4, 0x4, 0xc420091880, 0x1, 0x0, 0x0, 0x0, ...)
    /home/brentp/go/src/github.com/brentp/vcfanno/api/api.go:302 +0x1418
github.com/brentp/vcfanno/api.(*Annotator).AnnotateOne(0xc42001d4c0, 0x92d4e0, 0xc420629440, 0x795401, 0x0, 0x0, 0x0, 0xc4205bacd0, 0xc42047c680)
    /home/brentp/go/src/github.com/brentp/vcfanno/api/api.go:392 +0x1ed
github.com/brentp/vcfanno/api.(*Annotator).AnnotateEnds(0xc42001d4c0, 0x92d4e0, 0xc420629440, 0x0, 0x0, 0x10000, 0xc421306f20)
    /home/brentp/go/src/github.com/brentp/vcfanno/api/api.go:718 +0xdda
main.main.func1(0x92d4e0, 0xc420629440)
    /home/brentp/go/src/github.com/brentp/vcfanno/vcfanno.go:154 +0x71
github.com/brentp/irelate.PIRelate.func1.1(0xc4206293e0, 0xc420d2b980, 0x190, 0x190, 0xc423256300)
    /home/brentp/go/src/github.com/brentp/irelate/parallel.go:202 +0x5f
created by github.com/brentp/irelate.PIRelate.func1
    /home/brentp/go/src/github.com/brentp/irelate/parallel.go:207 +0x89

How would I use a gtf-file correctly?

brentp commented 7 years ago

For anython other than VCF, you'll have to use e.g. : columns=[8]

mpschr commented 7 years ago

OK, so the value is not selected according to the name field? Particularly in the GTF different lines may have different values at a certain column, depending on the element which is represented on the line in question.

brentp commented 7 years ago

yeah, I've thought about that, but haven't had many people (AFAICT) using/wanting GTF so I haven't done it. as it is now, you could get the full space-delimited field as an annotation and then grab part of it in a [postannotation] block.

I'll think about how to improve this, if you could explain your use-case more, it might motivate the dev.

mpschr commented 7 years ago

Well in my case I just tought of annotating the variant in the .vcf with the EXON_ID(s) and CCDS_ID(s) from ensembl. So only values from the exon-element lines should be taken into consideration.

mpschr commented 7 years ago

I was looking through the code and trying to figure out what module is missing for supporting true GTF compatibility. Since I am not familiar with the language, I get a bit confused, but let me ask: Would an interface in the irelate repository be enough implementing functions like setSource and BamToRelatable or is there more to it? What I did not find out which irelate is being used for the GTF right now.

Cheers

brentp commented 7 years ago

Here is where generic intervals are parsed using chrom,start,end fields gleaned from the tabix index. https://github.com/brentp/bix/blob/master/bix.go#L172

This would be a moderately involved change, but you are welcome to give it a go and ask questions. Or you could wait and I'll try to dig in by next week.

mpschr commented 7 years ago

Hi

Honestly, I feel a bit lost with the go-language and I am not sure I'd get it even working. I'll see if you find a solution - I reckon the parsing of the gtf-format is rather easy (compared with vcf): http://mblab.wustl.edu/GTF22.html

It is particularly helpful as many tools output their results gtf - from the top of my head, e.g. StringTie. With a StringTie GTF I would be able to easily annotate a mutation with the expression (or expressed transcripts).

Cheers