Closed hattrill closed 1 year ago
Some issues:
Just checking - we're recording issues with changing to direct import of InterPro2GO annotations from P2GO here? (rather than recording issues with the current pipeline?)
Yes (this is mainly so I can document the reasoning behind the specs). But I am going to note some changes that I expect to see/issues that get fixed here (or at least a linked doc), as it's going to make life hard sorting out what gets fixed by new pipeline. But the file from GO is on their tracker and if I need to point HarvDev to a new InterPro2go file, I'll do that on JIRA and the specs for the devs will go on JIRA.....I've renamed the ticket to help clarify
For completeness, link to issue about stale InterPro2go file on GO site
Currently IEA pipeline: gene2go - gene-interpro ; interpro2go -> check if term associtated with gene: if via IEA - if not, add or remove depending on Interpro2go file unless gene has manual annotation to term or child term, then do not add or remove IEA. Rfam2go: not implemented IBA: take all 'positive' annotatiosn from P2GO import, exclude any with negation (NOT) qualifier.
For the record, I'm going to itemize the issues I have with the current InterPro2GO pipeline (specifically), which I hope/think can get fixed by switching to direct import from protein2GO:
InterPro2GO annotations aren't currently shown in FB if there's a 'better' annotation to the same term. But it's very useful to see these 'redundant' annotations because they lend weight to any experimental/IBA annotation. And from curator point of view, you can't tell from looking at FB if there's a 'hidden' InterPro2GO annotation that needs fixing in addition to an EXP/IBA one!
Following from point 1, InterPro2GO annotations in FB (or Amigo/Alliance etc) don't match those at UniProt/QuickGO (or P2GO, from curator point of view), which is confusing.
The internal FB InterPro2GO pipeline seems quite complicated and has several dependencies, making it rather fragile when something changes.
The current FB InterPro2GO pipeline adds annotations to non-Dmel genes as well as Dmel genes. This creates two problems: i) There's a very biased set of GO annotations present for non-Dmel genes (~zero experimental annotations, and no IBA annotations) that aren't really comparable to the Dmel set - researchers interested in functional annotation of non-Dmel genomes would be better going to UniProt/QuickGO, where they also get additional computational annotations (UniRule, EC2GO etc). ii) Users wanting to search FB for Dmel GO annotations (99% of users?) get confused/distracted by GO searches (QS/Vocabularies) that return hits from (InterPro2GO-annotated) non-Dmel species.
Think that if we want to use manual NOTs as a basis for exclusion of IEAs, we should also do this for IBA. Want: Add all 'positive' IEAs, IBAs (never NOTs) regardless of manual annotation Except where there is a manual negative (NOT) annotation : I propose that this works as following:
Adding handy fig to illustrate. Do you agree @sjm41 - with the logic not the handiness of the figure ;-)
Adding list of things to decide on:
@hattrill I agree with your logic (and I also like the figure!). But I'm wondering if this really needs to be done at the FB end? Do all other databases also have to write custom code to deal with NOTs, or do they just live with a mixture of positive/negative annotations. Seems that the logic you describe could/should function at the P2GO (or Noctua) level, rather than each DB having to do it - or maybe that's what you're thinking?? (And for IBA stuff, I thought the NOTs are meant to be handled correctly now within PAINT??)
WRT 'not applicable' genes, I think it would be absolutely fine for these to lack any GO annotation. In fact, this might be helpful as a user won't get them returned in GO searches (which is probably the desired outcome, as they can't do any experiments/analysis on 'not applicable' genes.)
@hattrill I agree with your logic (and I also like the figure!). But I'm wondering if this really needs to be done at the FB end? Do all other databases also have to write custom code to deal with NOTs, or do they just live with a mixture of positive/negative annotations.
This has never been agreed on universally, despite many, many discussions. Ideally, this would work cross-the-board, but that is not a thing.
Seems that the logic you describe could/should function at the P2GO (or Noctua) level, rather than each DB having to do it - or maybe that's what you're thinking??
Noctua - can't handle this complexity P2GO: not sure if there is the bandwidth at the moment and whether UniProt would want this. For our users, it makes sense - particularly with searches - but there is the agrument that there are not enough NOTs to warrent special coding. And, realistically, if we want it, we will have to do it. I will give it more thought .....might just be my unwillingness to let go of what we have.
(And for IBA stuff, I thought the NOTs are meant to be handled correctly now within PAINT??)
Yes - I was thinking that too - would be nice not to have special rules here....perhaps I need a better system for making sure that these are spotted and fixed, as it seems to re-appear.
WRT 'not applicable' genes, I think it would be absolutely fine for these to lack any GO annotation. In fact, this might be helpful as a user won't get them returned in GO searches (which is probably the desired outcome, as they can't do any experiments/analysis on 'not applicable' genes.)
I am glad you think so - I was giving myself a headache with all the permutataions! Solved - ignore them!
Note: pipeline for InterPro2GO could go ahead without complex filtering if we can get an agreement on how to deal with IEAs across GOC - noting that there is a difference between groups display choices and GAF output.
Note: A survey cross GOC about InterPro2GO could be useful.
[x] Finish analysis of unfiltered InterPro2GO
Looking at the P2GO output, athough Alex has not yet confirmed, for enteries with a manual NOT seem to block InterPro2GO 'positive' - so that's good news (however, doesn't seem to happen with other IEA pipelines at GOA). The bad news- any other FBgn:UniProtKB without the NOT will still get the positive IEA.
when there is no UniProtKB-FBgn mapping in file, we do not update the entry as this could have arisen due to mapping error rather than all annotations being removed. Manual check to make sure that annotations are removed are ok when dealing with the set we have, but adding IEAs to the mix is going to make things harder to police, I think. So need spec what should happen in the event that a FBgn with GO data has no data in GPAD from P2GO - no way to tell if this is because of the way the gpi is generated (using IDs from the GPAD) or whether there is no mapping.......
Gil uploaded file for genes2go run from http://www.ebi.ac.uk/interpro/download/ as GO website still lagging. Attaching GAF for FB2022_02 and genes2go run file gene_association.fb.gz gene2go_fb_2022_02.out.gz
Check example: SM "Here’s just one example: I asked that glutathione oxidoreductase activity (GO:0097573) be added to IPR002109 in March 2021, and that got added to InterPro v85.0 (InterPro is now at v87.0): P2GO annotations reflect this, e.g. for members of FBgg0001699: Q9VJZ6, Q9W4S1, Q8SXQ5, Q9VNL4, Q9VVT6, Q9W2D1, Q9V420. (Though I can’t tell when P2GO started showing these….) But these annotations to GO:0097573 aren’t shown in FB"
grep'd IPR002109 from files from FB2022_02 release and this example looks fine:
From gene_association:
FB FBgn0051559 CG31559 is_active_in GO:0005575 FB:FBrf0159398|GO_REF:0000015 ND C 31559|CG11461|CG15584 proteintaxon:7227 20080922 UniProt
FB FBgn0051559 CG31559 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F 31559|CG11461|CG15584 protein taxon:7227 20220311 InterPro
[kochabpdncamacuk:GO/GAF_RELEASE/FB2022_02] hla28% grep IPR002109 gene_association.fb
FB FBgn0029662 CG12206 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F protein taxon:7227 20220311 InterPro
FB FBgn0030584 CG14407 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F BcDNA:RH03087|Grx5|Mitochondrial monothiol glutaredoxin-5 protein taxon:7227 20220311 InterPro
FB FBgn0051559 CG31559 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F 31559|CG11461|CG15584 protein taxon:7227 20220311 InterPro
FB FBgn0032509 CG6523 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Grx3|Grx4 protein taxon:7227 20220311 InterPro
FB FBgn0036820 Grx1 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Glutaredoxin 1 BcDNA:GH24739|CG6852 protein taxon:7227 20220311 InterPro
FB FBgn0034658 Grx1t enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Glutaredoxin 1, testis-specific CG7975|Grx-1 protein taxon:7227 20220311 InterPro
FB FBgn0004465 Su(P) enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Suppressor of ref(2)P sterility CG4086 protein taxon:7227 20220311 InterPro
From gene2go file: ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109
@sjm41 if there's anything you want to check, the files are attached.
Thanks Helen. I just did several spot-checks on several delayed InterPro2GO annotations among the oxidoreductases, and they all look like they are fixed! It will be easier for me to have a thorough check once the new web release is out, so I'll report any outstanding issues then.
- [x] One more thing to check: multigene protein families - do I need to have separate editting route for these?
CHECKED and have fixed, examined so it is ok for one UniProt ID which maps to >1 FBgn to be propagated. Would be good if annotations of member genes of an umbrella group were mapped to the parent group.
[x] Check Not applicable gene model status genes For these, it is ok to update where there is a UniProtKB:FBgn mapping - remove the rule to block this for the pipeline
[x] Check unmapped genes So, while it's ok to update those with 'Not Applicable' gene model status, those with 'Unannotated' gene model status should not be updated. These should be reviewed and removed, if possible as most are misleading/not useful.
[x] Work out how to deal with pipeline so that genes that don't get updated are flagged - see flow scheme
[x] Check annotations to non-dmel species - plan to move any worth saving to EBI GOA DB (https://flybase.atlassian.net/browse/DB-762) {sent email to FB to check if ok with this on 12th April}
GAF for non-Dmel species nondmel_gaf.fb.gz
Just confirming that the InterPro2GO annotations for enzymes in FB2022_02 have all been updated. :-)
Seems to be an issue with our NOT exclusion rules - if we move to getting IEAs from P2GO, not an issue. For FBrf0227974 PTPases with NOT enables phosphatase activity (IDA/IKR) also getting enables protein tyrosine phosphatase activity (IEA) - check other examples.
Here's 2 more examples: FBgn0033673/CG8298 FBgn0035392/CG1271 Both have "NOT enables transferase activity" but are getting positive child terms from InterPro2GO.
Not sure why this has happened - will get P2GO import route done by next release.
Added back redundant 'lower' tree annotations to fix this issue, as prob not going to get done until _06. In P2GO: added redundancy to fix downstream issue to: PMID:22825871 PMID:21466698 PMID:10827089 PMID:10913309 PMID:16545593 PMID:19849829 added Ticket pending:Postpipeline NOT retrofit
Made pull request for Rfam FBrf -> GO-ref mapping pull/1915
InterPro2GO annotations with old datestamp to be manually removed from chado in ha9116.edit. 20221004_IEA.txt.gz
(note: fixed Rfam xref in external DBxrefs DB1a Rfam DBid 61 DB2a The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). DB3a https://rfam.org/ DB3b https://rfam.org/family/)
This is a record of the issues that we have with the updating InterPro2GO pipeline and with implementing Rfam2go and improving IBA pipelines wrt making Protein2GO the source.