Implement import from Monarch Neo4J

cmungall commented 6 years ago

Advice: @kshefchek

aka "monarch lite transform"

will assign @deepakunni3 @yy20716

cmungall commented 6 years ago

We will also want to eliminate pseudo-blank nodes

cmungall commented 6 years ago

Discussed on call with @kshefchek @mbrush @putmantime @deepakunni3 @yy20716:

we will go via an intermediary format, but need to be conscious of io/sizes
May do link simplification centrally eventually but for now, duplicate

cmungall commented 6 years ago

Scenario: a reasoner team wants to bring in monarch d2p edges into their graph, together with properties about the diseases.

They run the kgx export command line tool (possibly via docker) and specific subjcat=disease, objcat=phenotype, and desired output format (graphml, csv, rdf, neo4jdump format). They can then load that directly - or possibly massage the output somehow

cmungall commented 6 years ago

I note the monarch-lite made before the last hackathon has equiv edges

we want to avoid this here (the same info can be put in as node properties)

kshefchek commented 6 years ago

I've made some progress cleaning up the scigraph.ncats.io graph

Created clique arrays as node properties, eg someNode.clique = [foo,bar]
Remove eq|sameAs edges and nodes
Added inhertiance labels, eg diseaseNode.inheritance = "Autosomal Recessive"
Added frequency, age of onset iri and labels (frequency, frequency_label, onset, onset_label)
Added publications, evidence codes as node properties (reified edges still there), eg node.sources = [], node.evidence []
Added inferred edges for gene to disease, some gene to phenotype (human), along with aggregating source and evidence lists in the process

For example to get gene to disease:

(:gene)-[edge:`http://purl.obolibrary.org/obo/RO_0002326`]->(:disease)

gene to phenotype

(:gene)-[edge:`http://purl.obolibrary.org/obo/RO_0002200`]->(:phenotype)

The inferences for human G2P are more liberal than what we index in solr for the monarch. Theres also mouse and zebrafish data.

disease to phenotype:

(:disease)-[edge:`http://purl.obolibrary.org/obo/RO_0002200`]->(:phenotype)

cmungall commented 6 years ago

Great

Remember spechas snake case for props

On Mon, Apr 30, 2018, 10:33 Kent Shefchek notifications@github.com wrote:

I've made some progress cleaning up the scigraph.ncats.io graph

Created clique arrays as node properties, eg someNode.clique = [foo,bar]

Remove eq|sameAs edges and nodes

Added inhertiance labels, eg diseaseNode.inheritance = "Autosomal Recessive"

Added frequency, age of onset iri and labels (frequency, frequency_label, onset, onset_label)

Added publications, evidence codes as node properties (reified edges still there), eg node.sources = [], node.evidence []

Added inferred edges for gene to disease, some gene to phenotype (human), along with aggregating source and evidence lists in the process

For example to get gene to disease:

(:gene)-[edge:http://purl.obolibrary.org/obo/RO_0002326]- http://purl.obolibrary.org/obo/RO_0002326%5D->(:disease)

gene to phenotype

(:gene)-[edge:http://purl.obolibrary.org/obo/RO_0002200]- http://purl.obolibrary.org/obo/RO_0002200%5D->(:phenotype)

The inferences for human G2P are more liberal than what we index in solr for the monarch. Theres also mouse and zebrafish data.

disease to phenotype:

(:disease)-[edge:http://purl.obolibrary.org/obo/RO_0002200]- http://purl.obolibrary.org/obo/RO_0002200%5D->(:phenotype)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NCATS-Tangerine/kgx/issues/1#issuecomment-385471023, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOZCRsC0UzMI3mu_QsscSuINnJr2Zks5tt0rNgaJpZM4Tg5Xx .

kshefchek commented 6 years ago

All new props are snake case, these are all the new ones: clique, frequency, frequency_label, onset, onset_label, sources, evidence, inheritance

kshefchek commented 6 years ago

After chatting with @putmantime, realized I've forgotten the edge property isDefinedBy which is generated at load time by SciGraph. I'll rerun and make these changes: sources - source of the data (ontology, rdf), replaces isDefinedBy publications - literature references, replaces what I was calling "sources" evidence - ECO codes, no change from current

cmungall commented 6 years ago

Added inhertiance labels, eg diseaseNode.inheritance = "Autosomal Recessive"

@mbrush we should add this as a node proper under disease in the model

cmungall commented 6 years ago

how is this coming along?

looking at http://neo4j.monarchinitiative.org/

seems we are

lacking names
lacking CURIE IDs
only have G2P2D?

kshefchek commented 6 years ago

lacking names

every node should have a name property, unless the rdfs:label was null, can you give me an example?

lacking CURIE IDs

I won't be able to add curie IDs with the current approach

only have G2P2D?

This is what I was able to do as a first pass, but we can add more before the hackathon

cmungall commented 6 years ago

On 8 May 2018, at 17:33, Kent Shefchek wrote:

lacking names every node should have a name property, unless the rdfs:label was null, can you give me an example?

see screenshot

lacking CURIE IDs I won't be able to add curie IDs with the current approach

hmm, should we explore going the original route of querying into in-memory or files and transforming those

only have G2P2D? This is what I was able to do as a first pass, but we can add more before the hackathon ok!

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/NCATS-Tangerine/kgx/issues/1#issuecomment-387584029

kshefchek commented 6 years ago

That node looks okay to me:

node

deepakunni3 commented 5 years ago

Keeping this issue open until we can have KGX read directly from Monarch and create a BioLink compliant Monarch KG.

TomConlin commented 4 years ago

@deepakunni3 Could this ticket please get some bread crumbs dropped to help trace the artifacts leading to its resolution.

sierra-moxon commented 2 years ago

Closing for now as there is a SRI Reference KG in KGEA I believe. It is also going through some revisions and refactoring alongside the Dipper refactor.

https://github.com/Knowledge-Graph-Hub/sri-reference-kg https://archive.translator.ncats.io/

biolink / kgx

Implement import from Monarch Neo4J #1