KG V2.0.0 - Finalizing Knowledge Representation

callahantiff commented 4 years ago

Extending Knowledge Representation for current KG

Current Release: v2.0.0

Description
Adding the following entities/data sources to the current KG build:

Variants via Clinvar
Proteins via PRO
New connections between existing ChEBI, GO, and Reactome concepts to proteins and genes - here:
- protein-protein, RO_0002434 (interacts with)
- protein-gobp, RO_0000056 (participates in)
- protein-gomf, RO_0000085 (has function)
- protein-gocc, RO_0001025 (located in)
- protein-cofactor/catalyst (ChEBI), RO_0002436 (molecularly interacts with)
- protein-complex (reactome), RO_0002436 (molecularly interacts with)
- gene-protein, RO_0002211 (regulates)
- chemical (ChEBI)-complex, RO_0002436 (molecularly interacts with)
- complex (reactome)-complex (reactome), RO_0002436 (molecularly interacts with)
- protein-pathway (reactome), RO_0000056 (participates in)
- protein-reaction (reactome), RO_0000056 (participates in)

TODO 📋 💻 📝

[x] Create edge types to connect variants to KG
[x] Verify ontological assumptions for edges provided by @ignaciot to ensure satisfiability and consistency with existing KR
[x] Investigate which version of PRO to download, specifically searching for one which only includes human proteins
[x] Update KR schema and verify it
[x] Update input documentation
[x] Add new data sources to wiki

@callahantiff Due Dates:

[x] Have KR and Wiki updated and finalized by 10/23/19
[ ] Begin building KG v2.1.0 by 10/23/19

ignaciot commented 4 years ago

Also:

protein-complex, RO_0002436

callahantiff commented 4 years ago

Also:

protein-complex, RO_0002436

Good catch, I will add to the original issue, thanks!

callahantiff commented 4 years ago

@ignaciot and @bill-baumgartner - the updated KR is shown below (click on image to enlarge). Note that the main data types are ontologies (yellow), open data sources (purple), and experimental data (blue). Note, that this has been verified by Adrianne as well.

Screen Shot 2019-11-25 at 09 57 47

You will notice that I have added the cell ontology and BRENDA in addition to experimental data in order to satisfy a component of my comps -- creating a KG, which actually includes the central dogma. GTEx is a great source to start with since it includes many disease types and has the results of both microarray and RNA-seq (for several types of samples), and includes connections to phenotypes.

Anywho, happy to talk about this more this afternoon!

ignaciot commented 4 years ago

This is fantastic! Yes, let’s chat this afternoon.

callahantiff commented 4 years ago

UPDATES: Incorporating feedback from @bill-baumgartner and @LEHunter resulted in the updated KR shown below.

Some important questions that I would like help answering:

@ignaciot - you've asked that I use the relation molecularly interacts with and the definition of this relation includes alternative terms, which specify that this relation specifically means binding interactions.
- Did you intend it to be used that way?
- If you did intend this relation be used then it might make sense to change the protein-protein and gene-gene interactions (or at least the gene-gene interactions) to genetically interacts with -- @LEHunter, thoughts?

@LEHunter - @bill-baumgartner and I were talking about the possibility of adding the inverse properties. What are your thoughts on this? Do you think it would have negative ramifications for things like random walk?

@bill-baumgartner - I kept the edge causally influences. I was unable to find anything in the RO for options like alters or has variant.

Any other errors or anything weird that needs editing?

I'd like to substitute the following BFO terms for the proposed RO terms, do you agree?:
- BFO:realizes to RO:realized in response to
- EXAMPLE: Biological Process realized in a pathway
- BFO:has component to RO:has function
- EXAMPLE: Pathway has function molecular function
- BFO:has part to RO:has component
- EXAMPLE: Pathway has component cellular location

ignaciot commented 4 years ago

Yup, that is the reason I chose molecularly interacts with (binding). Looking at the definition of that other relation (An interaction that holds between two genetic entities (genes, alleles) through some genetic interaction (e.g. epistasis)) I don't think that is appropriate for protein-protein interactions. It's probably fine for gene-gene interactions.

Maybe we could generate two graphs, one with the inverse relations where applicable and another without, to assess how much it could mess with random walk-based algorithms? I do suspect it may affect the node2vec results.

The rest looks fine to me (BTW, I like the new diagram!).

callahantiff commented 4 years ago

Yup, that is the reason I chose molecularly interacts with (binding). Looking at the definition of that other relation (An interaction that holds between two genetic entities (genes, alleles) through some genetic interaction (e.g. epistasis)) I don't think that is appropriate for protein-protein interactions. It's probably fine for gene-gene interactions.

OK, great. I also agree about the protein-protein interactions. Maybe we do this:

gene-gene genetically interacts with
- protein-protein molecularly interacts with
- chemical-gene interacts with
- protein-cofactor/catalyst molecularly interacts with
- chemical-complex molecularly interacts with
- protein-complex molecularly interacts with
- complex-complex molecularly interacts with

Maybe we could generate two graphs, one with the inverse relations where applicable and another without, to assess how much it could mess with random walk-based algorithms? I do suspect it may affect the node2vec results.

Yep, that's what Bill and I were thinking too :D.

The rest looks fine to me (BTW, I like the new diagram!).

Great, thanks for your feedback!

ignaciot commented 4 years ago

Thanks, great! I think molecularly interacts with is fine for protein-protein interactions, as that interaction implies physical binding.

ignaciot commented 4 years ago

And for chemical-gene is probably better to leave it as the more generic interacts with, as you suggested. Because there are many possible ways for the chemical to affect expression (which is what we imply by this interaction).

LEHunter commented 4 years ago

I suggest getting Mike to weigh in on these. I would like to be consistent with what he is doing with CRAFT for relations.

On Nov 26, 2019, at 3:26 PM, Tiffany J. Callahan notifications@github.com<mailto:notifications@github.com> wrote:

Yup, that is the reason I chose molecularly interacts with (binding). Looking at the definition of that other relation (An interaction that holds between two genetic entities (genes, alleles) through some genetic interaction (e.g. epistasis)) I don't think that is appropriate for protein-protein interactions. It's probably fine for gene-gene interactions.

OK, great. I also agree about the protein-protein interactions. Maybe we do this:

gene-gene genetically interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002435
protein-protein interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002434
chemical-gene interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002434
protein-cofactor/catalyst molecularly interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002436
chemical-complex molecularly interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002436
protein-complex molecularly interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002436
complex-complex molecularly interacts withhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002436

Maybe we could generate two graphs, one with the inverse relations where applicable and another without, to assess how much it could mess with random walk-based algorithms? I do suspect it may affect the node2vec results.

Yep, that's what Bill and I were thinking too :D.

The rest looks fine to me (BTW, I like the new diagram!).

Great, thanks for your feedback!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/callahantiff/PheKnowLator/issues/17?email_source=notifications&email_token=AACWZKP26NURHFGKTVCT3V3QVWPANA5CNFSM4JBOGE32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFHUT5A#issuecomment-558844404, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACWZKOLYRZKK47H3G3ZB5DQVWPANANCNFSM4JBOGE3Q.

LEHunter commented 4 years ago

On Nov 26, 2019, at 12:38 PM, Tiffany J. Callahan notifications@github.com<mailto:notifications@github.com> wrote:

@LEHunterhttps://github.com/LEHunter - @bill-baumgartnerhttps://github.com/bill-baumgartner and I were talking about the possibility of adding the inverse properties. What are your thoughts on this? Do you think it would have negative ramifications for things like random walk?

Inverse relations are a good idea. You can try random walk with and without them to see if it’s having any impact. Maybe also approaches like prohibiting the walk from traversing an edge twice.

callahantiff commented 4 years ago

I suggest getting Mike to weigh in on these. I would like to be consistent with what he is doing with CRAFT for relations.

OK, I will reach out to Mike. Thanks @LEHunter!

callahantiff commented 4 years ago

@LEHunter - I'd like to substitute the following BFO terms for the proposed RO terms, do you agree?:

BFO:realizes to RO:realized in response to
- EXAMPLE: Biological Process realized in a pathway
BFO:has component to RO:has function
- EXAMPLE: Pathway has function molecular function
BFO:has part to RO:has component
- EXAMPLE: Pathway has component cellular location

LEHunter commented 4 years ago

Seems reasonable to me, but please do check with Mike. It’s important to be consistent with CRAFT. And Mike has thought a lot about this stuff

L

On Nov 26, 2019, at 5:53 PM, Tiffany J. Callahan notifications@github.com wrote:

@LEHunterhttps://github.com/LEHunter - I'd like to substitute the following BFO terms for the proposed RO terms, do you agree?:

BFO:realizeshttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000055 to RO:realized in response tohttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0009501
- EXAMPLE: Biological Process realized in a pathway
BFO:has componenthttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000055 to RO:has functionhttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000085
- EXAMPLE: Pathway has function molecular function
BFO:has parthttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051 to RO:has componenthttps://www.ebi.ac.uk/ols/ontologies/ro/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002180
- EXAMPLE: Pathway has component cellular location

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/callahantiff/PheKnowLator/issues/17?email_source=notifications&email_token=AACWZKNGQ66QLKK6HBM4MCDQVXAI3A5CNFSM4JBOGE32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFH5K2I#issuecomment-558880105, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AACWZKPLUDL7DURSSB4S7XTQVXAI3ANCNFSM4JBOGE3Q.

callahantiff commented 4 years ago

Seems reasonable to me, but please do check with Mike. It’s important to be consistent with CRAFT. And Mike has thought a lot about this stuff L

Sounds good, I will follow-up with him. Thanks!

callahantiff commented 4 years ago

UPDATE: Mike has been emailed to ask about the BFO-RO and interaction triples. In the meantime, I am going to move forward with the representation shown below. Will also create a Bada-version 😉, once I hear back from him.

NOTE. For space reasons, I am not showing all edges with labels, but am suggesting there are inverse edges via the inclusion of a dotted line.

bill-baumgartner commented 4 years ago

@bill-baumgartner - I kept the edge causally influences. I was unable to find anything in the RO for options like alters or has variant.

Turns out the relation I was thinking of is in the Sequence Ontology and not the RO: https://www.ebi.ac.uk/ols/ontologies/so/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fso%23variant_of

callahantiff commented 4 years ago

@bill-baumgartner - I kept the edge causally influences. I was unable to find anything in the RO for options like alters or has variant.

Turns out the relation I was thinking of is in the Sequence Ontology and not the RO: https://www.ebi.ac.uk/ols/ontologies/so/properties?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fso%23variant_of

Thanks for letting me know! In that case, unless you are opposed, I will keep causally influences.

callahantiff commented 4 years ago

@ignaciot - would mind telling me how you found these?

protein-protein
protein-cofactor/catalyst (ChEBI)
protein-complex (reactome)
chemical (ChEBI)-complex
complex (reactome)-complex (reactome)
protein-pathway (reactome)
protein-reaction (reactome)

ignaciot commented 4 years ago

protein-protein

These came from Uniprot, and they are sourced from the IntAct database monthly (so it would be a good idea to keep those from STRING as well).

protein-cofactor/catalyst (ChEBI) protein-complex (reactome) chemical (ChEBI)-complex complex (reactome)-complex (reactome) protein-pathway (reactome) protein-reaction (reactome)

All of those came from Reactome, where they specified the participants (from Uniprot or ChEBI) to pathways, complexes and reactions.

callahantiff commented 4 years ago

protein-protein

These came from Uniprot, and they are sourced from the IntAct database monthly (so it would be a good idea to keep those from STRING as well).

Thanks! Update on protein-protein interactions. It looks like we will only need STRING as they already cover the data in IntAct (see screenshot below) 🎉

protein-cofactor/catalyst (ChEBI) protein-complex (reactome) chemical (ChEBI)-complex complex (reactome)-complex (reactome) protein-pathway (reactome) protein-reaction (reactome)

All of those came from Reactome, where they specified the participants (from Uniprot or ChEBI) to pathways, complexes and reactions.

Perfect! I will update the file parsers.

ignaciot commented 4 years ago

Awesome! One less thing to worry about, then.

callahantiff commented 4 years ago

@ignaciot - to build the gene - has_gene_product - protein triples, we need to identify all human protein coding genes and their gene products.

To get this, I'm using the human proteome (reviewed Swiss-Prot version) from Uniprot (here). Sound OK to you?

ignaciot commented 4 years ago

Yup! That makes sense.

callahantiff commented 4 years ago

@ignaciot - can you confirm that you agree with how I am building the protein-protein and protein-complex edges from the ComplexParticipantsPubMedIdentifiers_human.txt file:

FILE DATA

File Columns: [0] identifier; [1] name; [2] participants; [3] participatingComplex; [4] pubMedIdentifiers

Example File Output: NOTE. '|" substituted for ";" to build table in example below

identifier	name	participants	participatingComplex	pubMedIdentifiers
R-HSA-1006173	"CFH:Host cell surface [plasma membrane]"	uniprot:P08603;chebi:24505;chebi:28879	R-ALL-1006146	762425;16192651
R-HSA-1008206	"NF-E2:Promoter region of beta-globin [nucleoplasm]"	uniprot:Q16621;uniprot:Q9ULX9;uniprot:O15525;uniprot:O60675	R-HSA-1008229	8816476

BUILDING EDGES

Protein-Complex:

complexes are parsed from column [0]
proteins are parsed from column [2]

Example from file (from table above):

complex	protein
R-HSA-1006173	uniprot:P08603
R-HSA-1008206	uniprot:Q16621 uniprot:Q9ULX9 uniprot:O15525 uniprot:O6067

Complex-Complex:

complex_i parsed from column [0]
complex_j parsed from column [3]

Example from file (from table above):

complex	complex
R-HSA-1006173	R-ALL-1006146
R-HSA-1008206	R-HSA-1008229

ignaciot commented 4 years ago

These all look correct!

callahantiff commented 4 years ago

Good news, the draft of the sources of data for the edges and documentation of sources are complete. The edge counts will be updated and the files listed on the release page will be added as the KG is built.

@ignaciot - would you mind taking a gander at the following pages and let me know if anything seems incorrect?

Release V2.0.0
- All sources here are hyperlinked to the page in the next bullet.
Release V2.0.0 Knowledge Graph Data Sources
- The build of all sources in the mapping and filtering section of this page are now fully automated in a Jupyter Notebook, which is in progress (pending completion of PRO filtering), but can be viewed here if you are curious.

What's Left before KG Build:

Creating Human PRO ➞ In progress, running BFS for human protein nodes as I type
Adding f() for:
- Generate RO inverse edges ➞ will complete tomorrow, just needs testing
- Label instances ➞ will complete tomorrow, just needs testing
- Option to build a normal or abnormal KG ➞ will complete tomorrow, just needs testing

Once I confirm a few last details with @bill-baumgartner tomorrow (who was super helpful today, thanks Bill!), I will begin the build!

ignaciot commented 4 years ago

This looks AWESOME!! I'm glad those Uniprot/Reactome triples resulted to be useful, can't wait to play with the built graph. I went through all of the above and didn't see any errors. Thanks for adding those inverse relations, too!

Happy to help check the items above that need testing. Maybe we could think of a set of unit tests to write for each build, too (can wait until the next subrelease).

callahantiff commented 4 years ago

This looks AWESOME!! I'm glad those Uniprot/Reactome triples resulted to be useful, can't wait to play with the built graph. I went through all of the above and didn't see any errors. Thanks for adding those inverse relations, too!

Happy to help check the items above that need testing. Maybe we could think of a set of unit tests to write for each build, too (can wait until the next subrelease).

Thanks so much for your help @ignaciot! I think writing tests is a great idea. I also think it would be great if we added continuous integration. Perhaps we can chat more about this on Monday?

callahantiff commented 4 years ago

@bill-baumgartner and @ignaciot - the human version of the PRO is finally done! 🎉

Things to keep in mind:

It was created by building a new graph from running forward and reverse breadth first search over all human pro classes. Human pro classes were identified by querying the ontology for all ontology classes only_in_taxon some Homo sapiens (n=61,064 classes).
- @ignaciot - I found a way to do this such that it includes all of the things we discussed needed to be included (e.g. PR_000000019) 😄
The human version contains a single connected component
It includes pr:lacks_part and owl:disjointWith axioms. Given the articles I have read, I think this is totally acceptable for now. I will still remove owl:disjointWith axioms from the full KG before closing it, but until that point, I will keep the individual ontologies as complete as possible. This ensures the new human PRO is useful for others who might want to use it.
The human pro has been deductively closed using hermit . I compared the results to running ELK and they both stated the ontology was consistent and produced the same set of inferred axioms (n=174).

If you'd like to use it, you can download the closed and unclosed versions here:

human_pro.owl
human_pro_closed.owl

@LEHunter and @bill-baumgartner - should we offer this version and/or the script used to create it to the ProConsortium?

📢Now that the core ontologies are good to go, I will begin building KG. Updates to follow!

ignaciot commented 4 years ago

@ignaciot - I found a way to do this such that it includes all of the things we discussed needed to be included (e.g. PR_000000019) :-)

Awesome!! This is really cool.

I take it the pr:lacks_part and owl:disjointWith had to be kept to keep it a single connected component? (my only concert would be the incorrect assumption of a relation when creating node embeddings)

callahantiff commented 4 years ago

@ignaciot - I found a way to do this such that it includes all of the things we discussed needed to be included (e.g. PR_000000019) :-)

Awesome!! This is really cool.

I take it the pr:lacks_part and owl:disjointWith had to be kept to keep it a single connected component? (my only concert would be the incorrect assumption of a relation when creating node embeddings)

Thanks! 😄 After reading a bunch of articles, I chose to leave both types in for now as this is the way to create the most "correct"/authentic version of the human PRO. There is a way to keep 1 single component with removing each type. The owl:disjointWith will continued to be removed prior to closing the graph (we have been doing that since the first release (the reasoners ignore this axiom anyways) so they should not give you trouble with the embeddings. I believe the pr:lacks_part axioms should be OK. I read some of Hohendorf's papers about the pr:lacks_part axioms and given how they are constructed in PRO and CL, we should be fine. The real problem with including these are when they are constructed in the way PATO uses them. Either way, I'm happy to discuss this again once the KG is built 😄

ignaciot commented 4 years ago

Oh, absolutely, I don't think this is a reason to halt building the next KG version! I'm excited to try this out once it's built!

callahantiff commented 4 years ago

Oh, absolutely, I don't think this is a reason to halt building the next KG version! I'm excited to try this out once it's built!

Awesome! I’ll let you know when it’s ready for you!

callahantiff commented 4 years ago

Final KR for V2.0 builds are shown below.

Instance-Based Construction
PheKnowLator_v2 0 0_KnowledgeRepresentation_Instance

Subclass-Based Construction PheKnowLator_v2 0 0_KnowledgeRepresentation_subclass

Closing this issue.

callahantiff / PheKnowLator