geneontology / minerva

BSD 3-Clause "New" or "Revised" License
6 stars 8 forks source link

Update GPAD generation for models involving complexes #322

Open goodb opened 5 years ago

goodb commented 5 years ago
ukemi commented 5 years ago

I know that I am beating glycolysis to death, but I plan to look at that in NYC at the end of the week with.

ukemi commented 5 years ago

Out of scope for now?

goodb commented 5 years ago

Why is this no longer in scope? Before your work this summer I thought you were still keen to do it. It shouldn't be very difficult.

ukemi commented 5 years ago

It would be great if we can do it. It would require an additional step of converting the reactome entities to human genes. If I understand correctly, this would happen outside of the model because in the model the classes of reactome entities are instantiated. If you think it's straightforward, let's keep it as part of this project.

deustp01 commented 5 years ago

The protein:gene mapping is generated as a cross-reference at release time and displayed on our website as shown in the screenshot. The mapping is attached to the protein, I think, not to the reaction the protein mediates so I'm not sure where it would show up in the BioPax. ALDOB search results page

Screen Shot 2019-08-07 at 9 14 57 AM
goodb commented 5 years ago

The BioPAX only seems to contain the UniProt record id for the proteins. I can use the reactome mapping to get to any of the other ids. Where can I find the correct GPAD file from reactome for the comparison?

ukemi commented 5 years ago

Should be here: http://current.geneontology.org/annotations/index.html

goodb commented 5 years ago

@ukemi want to have a look at gpad for bmp ?

Here is what I see from the provided reactome gpad file when filtered for DBReferences = REACTOME:R-HSA-201451 . bmp-provided.gpad.txt

Here is what Minerva produces right now, given the go-cam based on the Reactome entity ontology (which does contain some direct references to uniprot ids). bmp.gpad.txt

And here is what I get when I map the reactome ids to UniProt ids - for Sets, we get one new annotation per member of the set. For complexes, we get 'contributes_to' for all the parts that enable something (this is actually already done by minerva). All the other annotations (e.g. part of, involved in) are passed directly on to the parts of the complexes. Note that the qualifier field is still using the reactome entity namespace. This can be fixed but will result in a small combinatorial explosion because of the Set/Set combinations that would be entailed.

bmp-mapped.gpad.txt

I think for comparative purposes, e.g. for the manuscript, we can safely ignore the extensions. Whether or not we want to tackle them robustly depends on whether you and @deustp01 want to use the go-cam-generated gpad to feed the GO pipeline.

???

ukemi commented 5 years ago

Let's look at this together on the call next Wednesday. I think ideally we would get the Reactome annotations from the models, but we can open that up for discussion. I think this leads to a wider discussion about the GPADs generated from models that is long overdue. @vanaukenk would you agree?

goodb commented 5 years ago

I've been looking at this today. Quick summary for the BMP Signaling pathway- limiting the comparison to gene-GOterm pairs (ignoring evidence etc.).

Reactome provides 16 annotations that reference this pathway. Of these 2 are recapitulated exactly in the go-cam-gpad for this pathway.
the gocam-gpad has a total of 44 annotations.

Note that the 44 is only counting unique gene-GOTerm pairs - there are 152 total uniprot-centric annotations produced from 63 reactome-centric. Have not checked for true duplicates yet.

The annotations from reactome that are missing from the gocam-gpad fall into two main categories: 1) gocam-gpad provides a more specific annotation. e.g. reactome provides 'gene involved_in pathway' while gocam-gpad provides 'gene contributes_to function in pathway'. and 2) the gocam-gpad does not contain annotations for genes that are just inputs or outputs to reactions. Reactome provides all of the inputs and outputs with "involved in" annotations.

gpad_comparison_bmp.xlsx

ukemi commented 5 years ago

Let's look at this together on the call tomorrow. At first thought, it looks like the GO-CAM annotations are better. We still need to do a sanity check. I think that would be the next step. @goodb can you give a 5-10 min overview of exactly how the annotations are generated from the models via the Reactome entities?

goodb commented 5 years ago

As its a little complicated, jotting down the protocol here:

Generating GPAD from reactome GO-CAM First keep in mind that all of the physical entities from Reactome (e.g. proteins, complexes, sets, small molecules) are represented in Noctua as instances of classes from a new ontology. For example, in Reactome we have the complex entity R-HSA-201477 named ‘BMP:p-BMPR:Endofin:SMAD1/5/8’. This is converted into an OWL class that is the intersection of other complexes and proteins (('has part' some SMAD1/5/8) and ('has part' some BMP:BMPRII:P-BMPRI) and ('has part' some ZFYVE16)). Each protein has its own Reactome class - e.g. SMAD5 = R-HSA-201431 and, wherever it is provided by Reactome, these are asserted to be equivalent to classes from uniprot such as Q99717. (The uniprot classes are the same as those that appear in neo.) The GO-CAM models use the entity classes just as they are in Reactome. If Reactome indicates that a complex contributes to a function, then we create one individual - link it to the class via RDF:type and then make the contributes_to assertion.
E.g. Instance1 rdf:type R-HSA-201477 Instance1 contributes_to MF_instance1 Mf_instance1 rdf:type MF_Class1

Screen Shot 2019-08-20 at 10 47 59 AM

When the GPAD for these models is created by Minerva, the class identifiers from the Reactome entity ontology are used where you would expect to find gene identifiers. You get annotations like: db objectid qualifier goid
gomodel R-HSA-201477 contributes_to GO:0016502

Now to convert these into gene annotations, we map from the Reactome class ids to UniProt ids. For R-HSA-201477 above, we find that it contains: P12643 Q13705 P36894 Q99717 Q7Z3T8 P27037 Q13873 O00238 O15198 Q15797 Now, depending on the annotation qualifier and the type of the reactome entity (e.g. complex, set, protein), we generate the uniprot-centric annotations. In the case of complexes, if the complex is said to ‘enable’ a function, the member proteins would get ‘contributes to’ annotations.
Right now, in all other cases, the member proteins receive all of the annotations of their parent molecule.
So one of the annotations generated from the above would be: UniProtKB Q13705 contributes_to GO:0016502

When reviewing these annotations it is also important to keep in mind that they contain inferred annotations as well as asserted. For example, the annotation in the example here is to GO:0016502 'nucleotide receptor activity'. This is inferred by Arachne before the GPAD is produced (see it in the square brackets).

goodb commented 5 years ago

Request to generate gpad outputs for subset of interesting pathways and to add columns with term labels to make interpretation easier.

ukemi commented 5 years ago

We will have a closer look at the BMP pathway results. Let's start with these since we have looked at the pathways geneontology/pathways2GO#40. @deustp01 and I will add others as we see fit and as we free up time to look. @huaiyumi - Signaling by BMP (R-HSA-201451.4); BMP signaling pathway (GO:0030509) @huaiyumi - MAPK1/MAPK3 signaling (R-HSA-5684996.4) @ukemi and @deustp01 - Glycolysis (R-HSA-70171); canonical glycolysis (GO:0061621) @ukemi and @deustp01 - Gluconeogenesis (R-HSA-70263); gluconeogenesis (GO:0006094) @vanaukenk - TCF dependent signaling in response to WNT (R-HSA-201681); canonical Wnt signaling pathway (GO:0060070) Not xref'd but generic pathway is. @vanaukenk - Unfolded Protein Response (UPR) (R-HSA-381119); endoplasmic reticulum unfolded protein response (GO:0030968) @ukemi and @deustp01 - GABA degradation (R-HSA-916853); gamma-aminobutyric acid catabolic process (GO:0009450) -[x] @ukemi and @deustp01 - PINK/PARKIN Mediated Autophagy (R-HSA-5205685); mitophagy (GO:0000423) or macroautophagy (GO:0016236).

goodb commented 4 years ago

@ukemi could you have a look at this gpad output? It is for gomodel:R-HSA-1971475 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1971475

I'm attaching one generated using the reactome entities tetra_gag_gpad_reactome_entities.txt and another that uses only entities that are in go-lego (uniprot, chebi). This is now happening automatically in my dev Minerva.
tetra_gag_gpad_uniprot_entities.txt

One weirdness I'm seeing in both are extensions referencing root protein in PRO - e.g. has_output(PR:000000001). Weird because I never use that term in REO or in the go-cam instances. I use the chebi protein uri when needed. @balhoff any idea there?

The annotation extensions for this model are rather enormous. This is a result of the use of Set objects in inputs and outputs. Expanding these out into all of their members grows things quickly.

goodb commented 4 years ago

I enabled the uniprot-chebi centric gpad conversion live on noctua-dev. @ukemi @deustp01 @vanaukenk please have a look at the reactome models of your choice and let me know what you think.

ukemi commented 4 years ago

Sweet. I won't have time this weekend, but hopefully will get to it this week. The extensions are out of control for other models too, not just the Reactome ones. As @balhoff suggested on the call, this is probably due to idiosyncrasies in the ontologies. Looking at the ones coming from conventional annotations in the past I don't think they are incorrect, but I don't think we want them all. We need to revisit the idea of white-listing and black-listing extensions in general.

goodb commented 4 years ago

Renamed this ticket such that it is closable. Once @ukemi and @deustp01 are satisfied with the GPAD generated from the reactome models, close this issue. In the meantime, GPAD focused discussions go here. (Noting that these should be specifically about Reactome idiosyncrasies rather than generalized go-cam to gpad problems.)

goodb commented 4 years ago

(moving note from 2019 NYU meeting here)

check property chains and, if needed:

request has_part o enables -> capable_of property chain capable_of o part_of -> capable of part of contributes_to o part_of -> involved_in

goodb commented 4 years ago

@ukemi @balhoff Are we good with the objectives in the tickboxes now at the definition of this issue? Should I work on them ?
I'd like to either finish this or move it out of this project.

vanaukenk commented 1 week ago

@ukemi - do you know if there is still work to be done on this ticket?

@pgaudet and I were looking at GPAD output for protein-containing complexes and their parts this morning so were wondering.

ukemi commented 1 week ago

IO believe the answer is yes. I don't see the contributes_to annotations as output and it is not clear to me how the evidence that a subunit is part of a complex that enables a function will resolve as an output.

ukemi commented 1 week ago

Also, for this ticket specifically, we need to have a more 'GO-compliant' representation of the imported Reactome complexes.