Closed dhimmel closed 3 years ago
Behind the UI, we use Virtuoso, the open-source version. As you've seen, it is really a quadstore, so that it stores tuples of the form <graph, subject, property, object>. The graph with IRI http://id,nlm.nih.gov/mesh/vocab stores the vocabulary itself, which can then be used as the RDFS ruleset for the other graphs. You can tease out the graphs by adding that to your query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
GRAPH ?g {
?s rdfs:subClassOf ?p .
}
}
Does this answer your questions? I'm really glad you are benefiting from it - it does not get as much manual traffic as you might think, even though there is a lot of API usage of the system.
it is really a quadstore, so that it stores tuples of the form <graph, subject, property, object>
I see! I was initially looking at https://hhs.github.io/meshrdf/descriptors and I assumed all visualized nodes where from the same graph.
So if I want to query a MeSH release with SPARQL, but where we store serve the database locally, I would need to load both of these files from the ftp site?
ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/vocabulary_1.0.0.ttl
ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt.gz
Would it be okay to load both of these files into a single rdflib merged graph? My goal is to write queries that can access the rdfs:subClassOf
relationships as well as the MeSH data?
You can tease out the graphs by adding that to your query
Ah good to know. Pasting the results from that query below as a reference:
g | s | p |
---|---|---|
http://www.w3.org/ns/ldp# | http://www.w2.org/ns/ldp#DirectContainer | http://www.w2.org/ns/ldp#Container |
http://www.w3.org/ns/ldp# | http://www.w2.org/ns/ldp#BasicContainer | http://www.w2.org/ns/ldp#Container |
http://www.w3.org/ns/ldp# | http://www.w2.org/ns/ldp#IndirectContainer | http://www.w2.org/ns/ldp#Container |
mesh:vocab | meshv:Concept | owl:Thing |
mesh:vocab | meshv:SCR_Chemical | meshv:SupplementaryConceptRecord |
mesh:vocab | meshv:SCR_Disease | meshv:SupplementaryConceptRecord |
mesh:vocab | meshv:TreeNumber | owl:Thing |
mesh:vocab | meshv:SCR_Organism | meshv:SupplementaryConceptRecord |
mesh:vocab | meshv:AllowedDescriptorQualifierPair | meshv:DescriptorQualifierPair |
mesh:vocab | meshv:DisallowedDescriptorQualifierPair | meshv:DescriptorQualifierPair |
mesh:vocab | meshv:GeographicalDescriptor | meshv:Descriptor |
mesh:vocab | meshv:PublicationType | meshv:Descriptor |
mesh:vocab | meshv:TopicalDescriptor | meshv:Descriptor |
mesh:vocab | meshv:CheckTag | meshv:Descriptor |
mesh:vocab | meshv:SCR_Protocol | meshv:SupplementaryConceptRecord |
mesh:vocab | owl:Thing | owl:Thing |
mesh:vocab | meshv:Descriptor | owl:Thing |
mesh:vocab | meshv:DescriptorQualifierPair | owl:Thing |
mesh:vocab | meshv:SupplementaryConceptRecord | owl:Thing |
mesh:vocab | meshv:Qualifier | owl:Thing |
mesh:vocab | meshv:Term | owl:Thing |
Does this answer your questions? I'm really glad you are benefiting from it
Thanks! My current goal is to load MeSH into a Python networkx directed graph (using nxontology). Basically, I want a single directed acyclic graph of concepts. I'm thinking that means I want to add meshv:Descriptor and meshv:SupplementaryConceptRecord records as nodes. Feel free to point me to any complimentary resources or efforts.
You can certainly do that. How you make use of the vocabulary depends on a lot on how your triple store does inference, and on your research need for inference, e.g. whether you need it. I've used rdflib for little things, but never for the full model, and so I don't feel like I am the expert to tell you what to do.
I can however expand a bit on inference. Inference makes a property statement such as "?d a meshv:Descriptor" work. Without it, you must very explicit, maybe using SPARQL UNION queries. So, in general, you can always rewrite queries to get around a lack of inference in a bespoke system, but it limits things if you are for instance implementing a question answering system.
Different triple stores do inference differently. Virtuoso uses separate graphs as a set of rules (and only does RDFS inference). Oracle SPATIAL and GRAPH calculates an "entailment", which is the full set of inferred triples, then those are loaded into another graph, and you defined a union graph with some sort of aliasing. A quick web search finds https://github.com/RDFLib/OWL-RL, which does limited OWL inferencing as well as RDFS inferencing. So, that would be enough, but I'm not sure whether this is the leading way to do inferencing with rdflib, or whether you need inferencing.
Since you are explicitly wanting to calculate the extra nodes you need to take it into a DAG system such as networkx, you can ignore the vocabulary file and create your own "entailment", adding the triples you need to make the entailment work by doing something like this:
SELECT ?d FROM <http://id.nlm.nih.gov/mesh>
WHERE {
{ ?d a meshv:TopicalDescriptor }
UNION { ?d a meshv:GeographicalDescriptor }
UNION { ?d a meshv:PublicationType }
UNION { ?d a meshv:CheckTag }
}
Using the results to generate the new nodes you need and inserting them into your graph. You can do a similar thing with other relationships you need.
I caution that networkx will certainly scale to MeSH RDF, but if you are thinking of adding something bigger such as PubChem RDF or SNOMED CT, you may want to think about a DAG system such as neo4j. Using a system like that will give you hosting options if you are going beyond research to a production system.
One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness.
I misspeak below. MeSH RDF cannot be represented as a tree, but should be able to be represented as a DAG.
One more comment - the reason we have our own vocabulary rather than using something like OWL is that MESH cannot be properly represented as a DAG. You should find the motivating paper by Olivier Bohdenreider before proceeding to "flatten" it into a DAG. It may of course work for a specific purpose, but our goal is to fully represent MeSH in RDF without loss of semantic richness.
you can ignore the vocabulary file and create your own "entailment"
This is probably the easiest solution, since we can list all classes we're interested. Then there are a few ways to structure the SPARQL query.
We really only need two queries: one for nodes and one for relationships. But rdflib is struggling here, in terms of running indefinitely for queries where https://id.nlm.nih.gov/mesh/query results within seconds.
So it might be nice to query a more performant database. You mentioned Virtuoso and neo4j. My main goals are SPARQL support and ease-of-setup. I like neo4j, but it probably isn't the right tool as it's not a native triplestore. I'd also be fine running our queries on the NLM Virtuoso instance, but I couldn't figure out how to access the full results when there were over 1000 results: see #150.
You should find the motivating paper by Olivier Bohdenreider
Okay, the following papers look relevant. Will review:
Desiderata for an authoritative Representation of MeSH in RDF
Rainer Winnenburg, Olivier Bodenreider
AMIA (2014-11-14) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419968/
PMID: 25954433 · PMCID: PMC4419968
Transforming the Medical Subject Headings into Linked Data: Creating the Authorized Version of MeSH in RDF
Barbara Bushman, David Anderson, Gang Fu
Journal of library metadata (2015) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4749162/
DOI: 10.1080/19386389.2015.1099967 · PMID: 26877832 · PMCID: PMC4749162
Would it be okay to load both of these files into a single rdflib merged graph?
I loaded vocabulary_1.0.0.ttl
into rdflib and was able to access the rdfs:subClassOf
relationships. Here's a graph of all the rdfs:subClassOf
relationships in the mesh vocab:
Also available as SVG at https://bit.ly/36W5up9.
I am going to close this issue since my original question has been answered. But happy to continue discussion on my subsequent questions.
Very cool - when they ask why we need "the software architect" maintaining this software, I may point to this discussion and ask whether they'd rather have a "principal investigator" from the group that does the science. Feel free to open an issue just to report back how it worked out.
This query returns results (online explorer):
This query returns no results (online explorer):
The difference being that the later query specifies
FROM <http://id.nlm.nih.gov/mesh>
. UsingFROM <http://id.nlm.nih.gov/mesh/2020>
also returns no results.The original query run via rdflib after loading
ftp://ftp.nlm.nih.gov/online/mesh/rdf/2020/mesh2020.nt
also returns no results.I think this is the same issue as https://github.com/HHS/meshrdf/issues/65, but it wasn't clear to me why this is or how to get
rdfs:subClassOf
relationships.Thanks for the help... am new to accessing MeSH via SPARQL / RDF.