FlyBase / GO-curation

For projects related to GO curation in FlyBase
MIT License
0 stars 0 forks source link

InterPro2GO/IBAs/Rfam2GO: summary of issue for external import #6

Closed hattrill closed 1 year ago

hattrill commented 2 years ago

This is a record of the issues that we have with the updating InterPro2GO pipeline and with implementing Rfam2go and improving IBA pipelines wrt making Protein2GO the source.

hattrill commented 2 years ago

Some issues:

sjm41 commented 2 years ago

Just checking - we're recording issues with changing to direct import of InterPro2GO annotations from P2GO here? (rather than recording issues with the current pipeline?)

hattrill commented 2 years ago

Yes (this is mainly so I can document the reasoning behind the specs). But I am going to note some changes that I expect to see/issues that get fixed here (or at least a linked doc), as it's going to make life hard sorting out what gets fixed by new pipeline. But the file from GO is on their tracker and if I need to point HarvDev to a new InterPro2go file, I'll do that on JIRA and the specs for the devs will go on JIRA.....I've renamed the ticket to help clarify

hattrill commented 2 years ago

For completeness, link to issue about stale InterPro2go file on GO site

hattrill commented 2 years ago

Currently IEA pipeline: gene2go - gene-interpro ; interpro2go -> check if term associtated with gene: if via IEA - if not, add or remove depending on Interpro2go file unless gene has manual annotation to term or child term, then do not add or remove IEA. Rfam2go: not implemented IBA: take all 'positive' annotatiosn from P2GO import, exclude any with negation (NOT) qualifier.

sjm41 commented 2 years ago

For the record, I'm going to itemize the issues I have with the current InterPro2GO pipeline (specifically), which I hope/think can get fixed by switching to direct import from protein2GO:

  1. InterPro2GO annotations aren't currently shown in FB if there's a 'better' annotation to the same term. But it's very useful to see these 'redundant' annotations because they lend weight to any experimental/IBA annotation. And from curator point of view, you can't tell from looking at FB if there's a 'hidden' InterPro2GO annotation that needs fixing in addition to an EXP/IBA one!

  2. Following from point 1, InterPro2GO annotations in FB (or Amigo/Alliance etc) don't match those at UniProt/QuickGO (or P2GO, from curator point of view), which is confusing.

  3. The internal FB InterPro2GO pipeline seems quite complicated and has several dependencies, making it rather fragile when something changes.

  4. The current FB InterPro2GO pipeline adds annotations to non-Dmel genes as well as Dmel genes. This creates two problems: i) There's a very biased set of GO annotations present for non-Dmel genes (~zero experimental annotations, and no IBA annotations) that aren't really comparable to the Dmel set - researchers interested in functional annotation of non-Dmel genomes would be better going to UniProt/QuickGO, where they also get additional computational annotations (UniRule, EC2GO etc). ii) Users wanting to search FB for Dmel GO annotations (99% of users?) get confused/distracted by GO searches (QS/Vocabularies) that return hits from (InterPro2GO-annotated) non-Dmel species.

hattrill commented 2 years ago

Think that if we want to use manual NOTs as a basis for exclusion of IEAs, we should also do this for IBA. Want: Add all 'positive' IEAs, IBAs (never NOTs) regardless of manual annotation Except where there is a manual negative (NOT) annotation : I propose that this works as following:

  1. where the manual NOT and IEA/IBA positive annotated term are the same, do not add the IEA/IBA
  2. where the manual NOT is more specific, add the less specific positive IEA/IBA (as this could be true - a manual same/higher level term should be added to show at what level the gene product does not perform the function)
  3. where the manual NOT is less specific, do not add the more specific IEA/IBA (as the parent NOT should propagate down the tree). NOT_logic

Adding handy fig to illustrate. Do you agree @sjm41 - with the logic not the handiness of the figure ;-)

hattrill commented 2 years ago

Adding list of things to decide on:

sjm41 commented 2 years ago

@hattrill I agree with your logic (and I also like the figure!). But I'm wondering if this really needs to be done at the FB end? Do all other databases also have to write custom code to deal with NOTs, or do they just live with a mixture of positive/negative annotations. Seems that the logic you describe could/should function at the P2GO (or Noctua) level, rather than each DB having to do it - or maybe that's what you're thinking?? (And for IBA stuff, I thought the NOTs are meant to be handled correctly now within PAINT??)

WRT 'not applicable' genes, I think it would be absolutely fine for these to lack any GO annotation. In fact, this might be helpful as a user won't get them returned in GO searches (which is probably the desired outcome, as they can't do any experiments/analysis on 'not applicable' genes.)

hattrill commented 2 years ago

@hattrill I agree with your logic (and I also like the figure!). But I'm wondering if this really needs to be done at the FB end? Do all other databases also have to write custom code to deal with NOTs, or do they just live with a mixture of positive/negative annotations.

This has never been agreed on universally, despite many, many discussions. Ideally, this would work cross-the-board, but that is not a thing.

Seems that the logic you describe could/should function at the P2GO (or Noctua) level, rather than each DB having to do it - or maybe that's what you're thinking??

Noctua - can't handle this complexity P2GO: not sure if there is the bandwidth at the moment and whether UniProt would want this. For our users, it makes sense - particularly with searches - but there is the agrument that there are not enough NOTs to warrent special coding. And, realistically, if we want it, we will have to do it. I will give it more thought .....might just be my unwillingness to let go of what we have.

(And for IBA stuff, I thought the NOTs are meant to be handled correctly now within PAINT??)

Yes - I was thinking that too - would be nice not to have special rules here....perhaps I need a better system for making sure that these are spotted and fixed, as it seems to re-appear.

WRT 'not applicable' genes, I think it would be absolutely fine for these to lack any GO annotation. In fact, this might be helpful as a user won't get them returned in GO searches (which is probably the desired outcome, as they can't do any experiments/analysis on 'not applicable' genes.)

I am glad you think so - I was giving myself a headache with all the permutataions! Solved - ignore them!

hattrill commented 2 years ago
hattrill commented 2 years ago

Looking at the P2GO output, athough Alex has not yet confirmed, for enteries with a manual NOT seem to block InterPro2GO 'positive' - so that's good news (however, doesn't seem to happen with other IEA pipelines at GOA). The bad news- any other FBgn:UniProtKB without the NOT will still get the positive IEA.

hattrill commented 2 years ago
hattrill commented 2 years ago
hattrill commented 2 years ago

when there is no UniProtKB-FBgn mapping in file, we do not update the entry as this could have arisen due to mapping error rather than all annotations being removed. Manual check to make sure that annotations are removed are ok when dealing with the set we have, but adding IEAs to the mix is going to make things harder to police, I think. So need spec what should happen in the event that a FBgn with GO data has no data in GPAD from P2GO - no way to tell if this is because of the way the gpi is generated (using IDs from the GPAD) or whether there is no mapping.......

hattrill commented 2 years ago

Gil uploaded file for genes2go run from http://www.ebi.ac.uk/interpro/download/ as GO website still lagging. Attaching GAF for FB2022_02 and genes2go run file gene_association.fb.gz gene2go_fb_2022_02.out.gz

hattrill commented 2 years ago

Check example: SM "Here’s just one example: I asked that glutathione oxidoreductase activity (GO:0097573) be added to IPR002109 in March 2021, and that got added to InterPro v85.0 (InterPro is now at v87.0): P2GO annotations reflect this, e.g. for members of FBgg0001699: Q9VJZ6, Q9W4S1, Q8SXQ5, Q9VNL4, Q9VVT6, Q9W2D1, Q9V420. (Though I can’t tell when P2GO started showing these….) But these annotations to GO:0097573 aren’t shown in FB"

grep'd IPR002109 from files from FB2022_02 release and this example looks fine:

From gene_association: FB FBgn0051559 CG31559 is_active_in GO:0005575 FB:FBrf0159398|GO_REF:0000015 ND C 31559|CG11461|CG15584 proteintaxon:7227 20080922 UniProt
FB FBgn0051559 CG31559 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F 31559|CG11461|CG15584 protein taxon:7227 20220311 InterPro
[kochabpdncamacuk:GO/GAF_RELEASE/FB2022_02] hla28% grep IPR002109 gene_association.fb FB FBgn0029662 CG12206 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F protein taxon:7227 20220311 InterPro
FB FBgn0030584 CG14407 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F BcDNA:RH03087|Grx5|Mitochondrial monothiol glutaredoxin-5 protein taxon:7227 20220311 InterPro
FB FBgn0051559 CG31559 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F 31559|CG11461|CG15584 protein taxon:7227 20220311 InterPro
FB FBgn0032509 CG6523 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Grx3|Grx4 protein taxon:7227 20220311 InterPro
FB FBgn0036820 Grx1 enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Glutaredoxin 1 BcDNA:GH24739|CG6852 protein taxon:7227 20220311 InterPro
FB FBgn0034658 Grx1t enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Glutaredoxin 1, testis-specific CG7975|Grx-1 protein taxon:7227 20220311 InterPro
FB FBgn0004465 Su(P) enables GO:0097573 FB:FBrf0174215|GO_REF:0000002 IEA InterPro:IPR002109 F Suppressor of ref(2)P sterility CG4086 protein taxon:7227 20220311 InterPro

From gene2go file: ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109 ADDED GO:0097573 - glutathione oxidoreductase activity due to InterPro:IPR002109

@sjm41 if there's anything you want to check, the files are attached.

sjm41 commented 2 years ago

Thanks Helen. I just did several spot-checks on several delayed InterPro2GO annotations among the oxidoreductases, and they all look like they are fixed! It will be easier for me to have a thorough check once the new web release is out, so I'll report any outstanding issues then.

hattrill commented 2 years ago

CHECKED and have fixed, examined so it is ok for one UniProt ID which maps to >1 FBgn to be propagated. Would be good if annotations of member genes of an umbrella group were mapped to the parent group.

hattrill commented 2 years ago
hattrill commented 2 years ago

GAF for non-Dmel species nondmel_gaf.fb.gz

Summary of annotations for non-Dmel species

sjm41 commented 2 years ago

Just confirming that the InterPro2GO annotations for enzymes in FB2022_02 have all been updated. :-)

hattrill commented 2 years ago
hattrill commented 2 years ago

Seems to be an issue with our NOT exclusion rules - if we move to getting IEAs from P2GO, not an issue. For FBrf0227974 PTPases with NOT enables phosphatase activity (IDA/IKR) also getting enables protein tyrosine phosphatase activity (IEA) - check other examples.

sjm41 commented 2 years ago

Here's 2 more examples: FBgn0033673/CG8298 FBgn0035392/CG1271 Both have "NOT enables transferase activity" but are getting positive child terms from InterPro2GO.

hattrill commented 2 years ago

Not sure why this has happened - will get P2GO import route done by next release.

hattrill commented 2 years ago

Added back redundant 'lower' tree annotations to fix this issue, as prob not going to get done until _06. In P2GO: added redundancy to fix downstream issue to: PMID:22825871 PMID:21466698 PMID:10827089 PMID:10913309 PMID:16545593 PMID:19849829 added Ticket pending:Postpipeline NOT retrofit

hattrill commented 2 years ago

Made pull request for Rfam FBrf -> GO-ref mapping pull/1915

hattrill commented 2 years ago

InterPro2GO annotations with old datestamp to be manually removed from chado in ha9116.edit. 20221004_IEA.txt.gz

hattrill commented 1 year ago

(note: fixed Rfam xref in external DBxrefs DB1a Rfam DBid 61 DB2a The Rfam database is a collection of RNA families, each represented by multiple sequence alignments, consensus secondary structures and covariance models (CMs). DB3a https://rfam.org/ DB3b https://rfam.org/family/)