brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
357 stars 55 forks source link

Trouble using 'first' op on dbNSFP txt file #129

Open IvantheDugtrio opened 3 years ago

IvantheDugtrio commented 3 years ago

I having trouble removing multi-allelic annotations from the dbNSFP reference text file, as it seems to always parse these fields as 'self'. I am thinking the problem has to do with limitations in parsing a text file versus a VCF. Is there a workaround?

brentp commented 3 years ago

my multi-allelic, you mean it has one line per alternate allele? i think self is the only option that will work correctly for that. what is your config?

IvantheDugtrio commented 3 years ago

By multi-allelic, I mean one column, say HGVSc_snpEff, has multiple annotations separated by commas, each corresponding to a different transcript ID, while the rest of the file is tab-delimited.

For example, one annotation line can have an HGVSc_snpEff field that looks like: c.44C>G,c.44C>G,c.57C>G,c.57C>G

My config is as follows:

file=dbNSFP.txt.gz columns=[13,21,22] ops=["first","first","first"] names=["dbNSFP_genename","dbNSFP_HGVSc_snpEff","dbNSFP_HGVSp_snpEff"]

brentp commented 3 years ago

hmm. I'm not sure this can work. what does ALT look like for that dbNSFP line?

IvantheDugtrio commented 3 years ago

Oh, I mis-stated, the multiple annotations in the same column are separated by semicolons. An actual line from the file looks like in the attached.

The REF and the ALT are still just single entries, in this case, a C for the REF, and a T for the ALT. T315I.txt

liserjrqlxue commented 3 years ago

dbNSFP has multi records of same variants which have different amino acid change. This may also case unexpect result.