dhimmel commented 3 years ago

This query returns results (online explorer):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
  ?s rdfs:subClassOf ?p .
}

This query returns no results (online explorer):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  ?s rdfs:subClassOf ?p .
}

The difference being that the later query specifies FROM <http://id.nlm.nih.gov/mesh>. Using FROM <http://id.nlm.nih.gov/mesh/2020> also returns no results.

The original query run via rdflib after loading ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt also returns no results.

I think this is the same issue as https://github.com/HHS/meshrdf/issues/65, but it wasn't clear to me why this is or how to get rdfs:subClassOf relationships.

Thanks for the help... am new to accessing MeSH via SPARQL / RDF.

danizen commented 3 years ago

Behind the UI, we use Virtuoso, the open-source version. As you've seen, it is really a quadstore, so that it stores tuples of the form <graph, subject, property, object>. The graph with IRI http://id,nlm.nih.gov/mesh/vocab stores the vocabulary itself, which can then be used as the RDFS ruleset for the other graphs. You can tease out the graphs by adding that to your query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
  GRAPH ?g {  
    ?s rdfs:subClassOf ?p .
  }
}

Does this answer your questions? I'm really glad you are benefiting from it - it does not get as much manual traffic as you might think, even though there is a lot of API usage of the system.

dhimmel commented 3 years ago

it is really a quadstore, so that it stores tuples of the form <graph, subject, property, object>

I see! I was initially looking at https://hhs.github.io/meshrdf/descriptors and I assumed all visualized nodes where from the same graph.

So if I want to query a MeSH release with SPARQL, but where we store serve the database locally, I would need to load both of these files from the ftp site?

ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/vocabulary_1.0.0.ttl
ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt.gz

Would it be okay to load both of these files into a single rdflib merged graph? My goal is to write queries that can access the rdfs:subClassOf relationships as well as the MeSH data?

You can tease out the graphs by adding that to your query

Ah good to know. Pasting the results from that query below as a reference:

g	s	p
http://www.w3.org/ns/ldp#	http://www.w2.org/ns/ldp#DirectContainer	http://www.w2.org/ns/ldp#Container
http://www.w3.org/ns/ldp#	http://www.w2.org/ns/ldp#BasicContainer	http://www.w2.org/ns/ldp#Container
http://www.w3.org/ns/ldp#	http://www.w2.org/ns/ldp#IndirectContainer	http://www.w2.org/ns/ldp#Container
mesh:vocab	meshv:Concept	owl:Thing
mesh:vocab	meshv:SCR_Chemical	meshv:SupplementaryConceptRecord
mesh:vocab	meshv:SCR_Disease	meshv:SupplementaryConceptRecord
mesh:vocab	meshv:TreeNumber	owl:Thing
mesh:vocab	meshv:SCR_Organism	meshv:SupplementaryConceptRecord
mesh:vocab	meshv:AllowedDescriptorQualifierPair	meshv:DescriptorQualifierPair
mesh:vocab	meshv:DisallowedDescriptorQualifierPair	meshv:DescriptorQualifierPair
mesh:vocab	meshv:GeographicalDescriptor	meshv:Descriptor
mesh:vocab	meshv:PublicationType	meshv:Descriptor
mesh:vocab	meshv:TopicalDescriptor	meshv:Descriptor
mesh:vocab	meshv:CheckTag	meshv:Descriptor
mesh:vocab	meshv:SCR_Protocol	meshv:SupplementaryConceptRecord
mesh:vocab	owl:Thing	owl:Thing
mesh:vocab	meshv:Descriptor	owl:Thing
mesh:vocab	meshv:DescriptorQualifierPair	owl:Thing
mesh:vocab	meshv:SupplementaryConceptRecord	owl:Thing
mesh:vocab	meshv:Qualifier	owl:Thing
mesh:vocab	meshv:Term	owl:Thing

Does this answer your questions? I'm really glad you are benefiting from it

Thanks! My current goal is to load MeSH into a Python networkx directed graph (using nxontology). Basically, I want a single directed acyclic graph of concepts. I'm thinking that means I want to add meshv:Descriptor and meshv:SupplementaryConceptRecord records as nodes. Feel free to point me to any complimentary resources or efforts.

danizen commented 3 years ago

You can certainly do that. How you make use of the vocabulary depends on a lot on how your triple store does inference, and on your research need for inference, e.g. whether you need it. I've used rdflib for little things, but never for the full model, and so I don't feel like I am the expert to tell you what to do.

I can however expand a bit on inference. Inference makes a property statement such as "?d a meshv:Descriptor" work. Without it, you must very explicit, maybe using SPARQL UNION queries. So, in general, you can always rewrite queries to get around a lack of inference in a bespoke system, but it limits things if you are for instance implementing a question answering system.

Different triple stores do inference differently. Virtuoso uses separate graphs as a set of rules (and only does RDFS inference). Oracle SPATIAL and GRAPH calculates an "entailment", which is the full set of inferred triples, then those are loaded into another graph, and you defined a union graph with some sort of aliasing. A quick web search finds https://github.com/RDFLib/OWL-RL, which does limited OWL inferencing as well as RDFS inferencing. So, that would be enough, but I'm not sure whether this is the leading way to do inferencing with rdflib, or whether you need inferencing.

danizen commented 3 years ago

Since you are explicitly wanting to calculate the extra nodes you need to take it into a DAG system such as networkx, you can ignore the vocabulary file and create your own "entailment", adding the triples you need to make the entailment work by doing something like this:

SELECT ?d FROM <http://id.nlm.nih.gov/mesh>
WHERE {
  { ?d a meshv:TopicalDescriptor } 
  UNION { ?d a meshv:GeographicalDescriptor }
  UNION { ?d a meshv:PublicationType }
  UNION { ?d a meshv:CheckTag }
}

Using the results to generate the new nodes you need and inserting them into your graph. You can do a similar thing with other relationships you need.

I caution that networkx will certainly scale to MeSH RDF, but if you are thinking of adding something bigger such as PubChem RDF or SNOMED CT, you may want to think about a DAG system such as neo4j. Using a system like that will give you hosting options if you are going beyond research to a production system.

danizen commented 3 years ago

One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness.

danizen commented 3 years ago

I misspeak below. MeSH RDF cannot be represented as a tree, but should be able to be represented as a DAG.

One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness.

dhimmel commented 3 years ago

you can ignore the vocabulary file and create your own "entailment"

This is probably the easiest solution, since we can list all classes we're interested. Then there are a few ways to structure the SPARQL query.

We really only need two queries: one for nodes and one for relationships. But rdflib is struggling here, in terms of running indefinitely for queries where https://id.nlm.nih.gov/mesh/query results within seconds.

So it might be nice to query a more performant database. You mentioned Virtuoso and neo4j. My main goals are SPARQL support and ease-of-setup. I like neo4j, but it probably isn't the right tool as it's not a native triplestore. I'd also be fine running our queries on the NLM Virtuoso instance, but I couldn't figure out how to access the full results when there were over 1000 results: see #150.

You should find the motivating paper by Olivier Bohdenreider

Okay, the following papers look relevant. Will review:

Desiderata for an authoritative Representation of MeSH in RDF
Rainer Winnenburg, Olivier Bodenreider
AMIA (2014-11-14) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419968/
PMID: 25954433 · PMCID: PMC4419968
Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF
Barbara Bushman, David Anderson, Gang Fu
Journal of library metadata (2015) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4749162/
DOI: 10.1080/19386389.2015.1099967 · PMID: 26877832 · PMCID: PMC4749162

dhimmel commented 3 years ago

rdfs:subClassOf graph

Would it be okay to load both of these files into a single rdflib merged graph?

I loaded vocabulary_1.0.0.ttl into rdflib and was able to access the rdfs:subClassOf relationships. Here's a graph of all the rdfs:subClassOf relationships in the mesh vocab:

mesh-subclassof

Also available as SVG at https://bit.ly/36W5up9.

python source & output graphviz dot

## python source ```python import pandas as pd import fsspec import rdflib import networkx as nx from networkx.drawing.nx_pydot import write_dot rdf = rdflib.Graph() # load MeSH vocabulary url = "ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/vocabulary_1.0.0.ttl" with fsspec.open(url, "rt") as src: # https://github.com/HHS/meshrdf/issues/153 rdf.parse(source=src, format="n3") query=''' PREFIX rdfs: SELECT ?subject_suffix ?object_suffix WHERE { ?subject rdfs:subClassOf ?object . BIND( STRAFTER(STR(?subject), "#") AS ?subject_suffix) BIND( STRAFTER(STR(?object), "#") AS ?object_suffix) } ORDER BY ?subject_suffix ?predicate_suffix ''' results = rdf.query(query) subclass_df = sparql_results_to_df(results) subclass_df.head(2) graph = nx.DiGraph() for row in subclass_df.itertuples(): graph.add_edge(row.object_suffix, row.subject_suffix) write_dot(graph, "mesh-subclassof.dot") ``` ## graphviz source ```dot # Medical Subject Headings (MeSH) Vocabulary rdfs:subClassOf graph digraph { DescriptorQualifierPair; AllowedDescriptorQualifierPair; Descriptor; CheckTag; Thing; Concept; DisallowedDescriptorQualifierPair; GeographicalDescriptor; PublicationType; Qualifier; SupplementaryConceptRecord; SCR_Chemical; SCR_Disease; SCR_Organism; SCR_Protocol; Term; TopicalDescriptor; TreeNumber; DescriptorQualifierPair -> AllowedDescriptorQualifierPair; DescriptorQualifierPair -> DisallowedDescriptorQualifierPair; Descriptor -> CheckTag; Descriptor -> GeographicalDescriptor; Descriptor -> PublicationType; Descriptor -> TopicalDescriptor; Thing -> Concept; Thing -> Descriptor; Thing -> DescriptorQualifierPair; Thing -> Qualifier; Thing -> SupplementaryConceptRecord; Thing -> Term; Thing -> Thing; Thing -> TreeNumber; SupplementaryConceptRecord -> SCR_Chemical; SupplementaryConceptRecord -> SCR_Disease; SupplementaryConceptRecord -> SCR_Organism; SupplementaryConceptRecord -> SCR_Protocol; } ```

I am going to close this issue since my original question has been answered. But happy to continue discussion on my subsequent questions.

danizen commented 3 years ago

Very cool - when they ask why we need "the software architect" maintaining this software, I may point to this discussion and ask whether they'd rather have a "principal investigator" from the group that does the science. Feel free to open an issue just to report back how it worked out.

HHS / meshrdf

rdfs:subClassOf relationships missing from MeSH RDF #153

rdfs:subClassOf graph