Add a loader for blazegraph

cmungall commented 6 years ago

cc @balhoff

yy20716 commented 6 years ago

Chris, if you don't mind, can you clarify what the loader is supposed to do here? In readme.MD, I saw that the internal graph model will be a networkx MultiDiGraph, so will this loader convert and store the graph as RDF graphs? If you could let us know your overall goals and flows, it would be very helpful. Thank you.

cmungall commented 6 years ago

Yes, I was unclear. This should be broken into two parts.

The first is a direct translation from the networkx property graph to RDF, employing relevant CURIE expansion, and selecting a reification model (OBAN by default, but this could be configurable; e.g. RdR in blazegraph).

For the second part we don't actually need any code. We can just have a blazegraph conf file and do everything via docker, see: https://github.com/monarch-initiative/mondo/blob/749cd1104d6b195d3860a7cf439c44c5dc86732f/src/ontology/Makefile#L610-L638

yy20716 commented 6 years ago

Chris, I first ran the testcase before extending it, i.e. test_load in test_rdf.py, I saw that the internal graph object was empty because the parse() function is not completed, i.e. the line for self.load_edges(rdfgraph) was commented. I re-ran after removing the comment and saw no errors. I just wonder whether there were any reasons to comment that line. I see some edges that have two identical nodes (e.g., ('NCBIGene:6908', 'NCBIGene:6908')), which could be the reasons, but any explanations could be helpful. Thank you.

cmungall commented 6 years ago

wonder whether there were any reasons

probably not, this is mostly stub code to get things off the ground

yy20716 commented 6 years ago

Chris, thank you for your clarification. Let me ask another question if you don't mind. When I see the load_edges function in ObanRdfTransformer, I see that the dummy subject for each entity is not stored but discard. This becomes the problem when we need to re-store the graph because the subject information is missing. For example,

<https://monarchinitiative.org/MONARCH_729e0868993f591188f8409a5eeaa64a70ec27b7> a OBAN:association ;
    OBO:RO_0002558 OBO:ECO_0000085 ;
    dc:source <http://www.ncbi.nlm.nih.gov/pubmed/11279055> ;
    OBAN:association_has_object <http://www.ncbi.nlm.nih.gov/gene/4591> ;
    OBAN:association_has_predicate OBO:RO_0002434 ;
    OBAN:association_has_subject <http://www.ncbi.nlm.nih.gov/gene/4591> .

is represented as an item of an adjacent list, i.e.

NCBIGene:4591 {0: {'predicate': 'RO:0002434', rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'): 'OBAN:association', 'subject': 'NCBIGene:4591', 'provided_by': 'tests/resources/monarch/biogrid_test.ttl', 'object': 'NCBIGene:4591', rdflib.term.URIRef('http://purl.obolibrary.org/obo/RO_0002558'): 'ECO:0000085', rdflib.term.URIRef('http://purl.org/dc/elements/1.1/source'): 'PMID:11279055'}}

and as you can see, the item does not include https://monarchinitiative.org/MONARCH_729e0868993f591188f8409a5eeaa64a70ec27b7.

So I wonder what we should do. Which one would be a good solution? I guess we could (i) modify load_edge function to include the dummy subject as the part of the meta information, so that it can be recovered or (ii) generate another dummy subject as a random string with the https://monarchinitiative.org/MONARCH as a prefix. Any suggestions would be appreciated.

cmungall commented 6 years ago

Yes, preserve as id see also #21

yy20716 commented 6 years ago

Chris, it seems that my patch may need additional works. When I checked the spec doc, I saw that

id [required]: MUST be a CURIE, MUST use translator-mandated prefix

I didn't consider this issue yet so I was trying to fix but I was stuck because I was not sure what translator-mandated prefix means (actually eric asked the same question in the doc). Could you please let me know where I can get this one? Do I just need to use the ones in prefixcommon (the one used for curie_util.py) for now? Thank you.

biolink / kgx

Add a loader for blazegraph #12