brentp / geneimpacts

prioritize effects of variant annotations from VEP, SnpEff, et al.
MIT License
32 stars 15 forks source link

Fix SnpEff exonic property #19

Open wm75 opened 4 years ago

wm75 commented 4 years ago

A Gemini user on usegalaxy.eu reported yesterday that gemini's is_exonic column is wrongly populated for SnpEff-annotated composite effects. I think this simple typo here is the reason.

wm75 commented 4 years ago

In addition, I think the consequence property of OldSnpEff may be wrong, too. The old EFF annotations are using +, not &, right?

wm75 commented 4 years ago

Ah, OldSnpEff does not even have a self.effects['Transcript_BioType'] (hence the now failing test). Also, SnpEff.coding seems to suffer from the same consequence vs consequences mix-up.

I could fix those, but maybe tell me if I'm completely wrong with my assumptions first.

brentp commented 4 years ago

yes. looks like you're right. not sure how this has escaped noticed for so long. that'd be great if you updated to include OldSnpEff.

wm75 commented 4 years ago

Ok, so this was a bit more work than I intended originally, but I cleaned up the class hierarchy of Effect and its subclasses quite a bit and fixed a number of bugs, mostly around the detection of exonic and coding effects of variants:

wm75 commented 4 years ago

The thing is though that after all this cleanup work it becomes apparent that most of it won't help Gemini because from https://github.com/arq5x/gemini/blob/5db2e52ae4aee413a8780df538565c90a38e4b11/gemini/gemini_load_chunk.py#L498 onwards Gemini only uses the top impact effect and its top consequence to populate is_exonic, is_coding, is_lof and is_splicing. At least for is_exonic and is_coding this is clearly against what the documentation of the Gemini database scheme says. I'm just not sure which behavior, the documented or the implemented one, is the desired one.

If you want I can raise this as a separate issue in the Gemini repo.

brentp commented 4 years ago

hi, thanks for looking into this.

re the gemini note. I think it's doing the right thing. the top_impact is stored in the variants table. and it will have the correct is_exonic, etc for that chosen impact. all other impacts are stored in the variant_impacts table with is_exonic et al set according to each impact.

wm75 commented 4 years ago

yeah, I guess you can argue like that, which would mean that the docs need an update.

wm75 commented 4 years ago

The counter argument would be one of practicality: what's documented - is_exonic = 1 if any effect on any transcript is an exonic one - is just very hard to derive from the variant_impacts table. Haven't actually verified that it's possible at all.