ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.
Apache License 2.0
351 stars 42 forks source link

Using QLever for the PubChem RDF data #711

Open donpellegrino opened 2 years ago

donpellegrino commented 2 years ago

When indexing is performed with IndexBuilderMain by streaming RDF serialized in Turtle, indexing will fail if the input has @prefix directives after triples.

Example input: https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/bioassay/pc_bioassay.ttl.gz The first three lines are @prefix directives. Line 1,465,988 has another @prefix directive. The prefixes are declared before referenced, but they are not all at the top of the file.

A workaround is to grep all the input files, aggregate all the @prefix lines, and then ensure they are passed into IndexBuilderMain before any triples.

An improvement would be to handle the @prefix statements as they occur, eliminating the manual preprocessing necessary for the workaround.

Additionally, the current error message could be modified to make instances of this case clear to the user.

hannahbast commented 2 years ago

Yes, and the problem is that the input stream is parsed in parallel, so that it can happen that a triple is parsed before the respective prefix definition has been parsed. This is on our TODO list and will be fixed soon. In the meantime, you can simply do something like this:

ulimit -Sn 1048576; zcat -f pubchem.prefix-definitions $(ls *.ttl.gz) | IndexBuilderMain IndexBuilderMain -f - -i pubchem -s pubchem.settings.json

I am using this pubchem.settings.json (reduce the batch size if you run out of memory):

{ "ascii-prefixes-only": false, "num-triples-per-batch": 10000000 }

Here is the contents of pubchem.prefix-definitions (which I computed automatically from the input files, but no need to do that again for every index build or index build trial):

@prefix bao:    <http://www.bioassayontology.org/bao#> .
@prefix bioassay:   <http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/> .
@prefix bp: <http://www.biopax.org/release/biopax-level3.owl#> .
@prefix chemblchembl:   <http://linkedchemistry.info/chembl/chemblid/> .
@prefix chembl: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> .
@prefix cito:   <http://purl.org/spar/cito/> .
@prefix compound:   <http://rdf.ncbi.nlm.nih.gov/pubchem/compound/> .
@prefix concept:    <http://rdf.ncbi.nlm.nih.gov/pubchem/concept/> .
@prefix conserveddomain:    <http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/> .
@prefix dcterms:    <http://purl.org/dc/terms/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix descriptor: <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/> .
@prefix disease:    <http://rdf.ncbi.nlm.nih.gov/pubchem/disease/> .
@prefix endpoint:   <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/> .
@prefix ensembl:    <http://rdf.ebi.ac.uk/resource/ensembl/> .
@prefix fabio:  <http://purl.org/spar/fabio/> .
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix freq: <http://purl.org/cld/freq/> .
@prefix gene:   <http://rdf.ncbi.nlm.nih.gov/pubchem/gene/> .
@prefix : <http://rdf.ncbi.nlm.nih.gov/pubchem/void.ttl#> .
@prefix inchikey:   <http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/> .
@prefix measuregroup:   <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/> .
@prefix mesh:   <http://id.nlm.nih.gov/mesh/> .
@prefix nci:    <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#> .
@prefix ns0:    <http://data.epo.org/linked-data/def/patent/> .
@prefix ns100:  <http://www.mycomagic.com.tw/> .
@prefix ns101:  <https://www.nature.com/> .
@prefix ns102:  <http://www.gene.com/scientists/our-scientists/> .
@prefix ns103:  <https://www.inserm.fr/> .
@prefix ns104:  <https://www.ed.ac.uk/> .
@prefix ns105:  <https://www.expasy.org/resources/> .
@prefix ns106:  <https://lincsproject.org/LINCS/data/> .
@prefix ns107:  <http://lincsportal.ccs.miami.edu/dcic-portal/#/> .
@prefix ns108:  <https://www.hiv.gov/about-us/> .
@prefix ns109:  <https://www.marinespecies.org/about.php#> .
@prefix ns10:   <https://www.novartis.com/our-science/> .
@prefix ns110:  <https://ecmdb.ca/> .
@prefix ns111:  <https://www.alliancegenome.org/> .
@prefix ns112:  <http://www.hmdb.ca/> .
@prefix ns113:  <https://commonfund.nih.gov/molecularlibraries/> .
@prefix ns114:  <https://www.thesgc.org/> .
@prefix ns115:  <https://www.tcichemicals.com/US/> .
@prefix ns116:  <https://horizondiscovery.com/products/custom-synthesis/> .
@prefix ns117:  <https://www.guidetopharmacology.org/about.jsp#> .
@prefix ns118:  <https://www.nist.gov/srd/> .
@prefix ns11:   <https://www.petermac.org/research/core-facilities/> .
@prefix ns12:   <http://www.petermac.org/research/enabling-research/> .
@prefix ns13:   <http://cmld.ku.edu/> .
@prefix ns14:   <http://identifiers.org/wikipathways/> .
@prefix ns14:   <https://www.ucl.ac.uk/wolfson-institute-biomedical-research/> .
@prefix ns15:   <https://biochem.unl.edu/> .
@prefix ns16:   <http://plantreactome.gramene.org/content/detail/> .
@prefix ns16:   <https://www.vttresearch.com/> .
@prefix ns17:   <http://pathbank.org/view/> .
@prefix ns17:   <https://www.epa.gov/> .
@prefix ns18:   <https://www.fda.gov/industry/structured-product-labeling-resources/> .
@prefix ns18:   <https://www.pharmgkb.org/pathway/> .
@prefix ns19:   <http://identifiers.org/biocyc/HUMANCYC:> .
@prefix ns19:   <https://www.fda.gov/drugs/drug-approvals-and-databases/> .
@prefix ns1:    <http://data.epo.org/linked-data/def/patent/> .
@prefix ns1:    <http://rdf.ncbi.nlm.nih.gov/pubchem/patentassignee/> .
@prefix ns1:    <http://rdf.ncbi.nlm.nih.gov/pubchem/patentinventor/> .
@prefix ns1:    <http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/> .
@prefix ns20:   <http://identifiers.org/biocyc/METACYC:> .
@prefix ns20:   <https://www.ncbi.nlm.nih.gov/> .
@prefix ns21:   <http://identifiers.org/biocyc/ECOCYC:> .
@prefix ns21:   <https://en.wikipedia.org/wiki/> .
@prefix ns22:   <http://identifiers.org/biocyc/YEASTCYC:> .
@prefix ns22:   <https://www.niaid.nih.gov/research/> .
@prefix ns23:   <http://identifiers.org/biocyc/ARACYC:> .
@prefix ns23:   <https://www.cancer.gov/policies/> .
@prefix ns24:   <http://identifiers.org/biocyc/LEISHCYC:> .
@prefix ns24:   <https://www.drugbank.ca/legal/> .
@prefix ns25:   <http://identifiers.org/biocyc/TRYPANOCYC:> .
@prefix ns25:   <https://www.epa.gov/privacy/> .
@prefix ns26:   <http://identifiers.org/biocyc/VCHOCYC:> .
@prefix ns26:   <https://www.fda.gov/about-fda/about-website/website-policies#> .
@prefix ns27:   <http://identifiers.org/biocyc/SHIGELLACYC:> .
@prefix ns27:   <https://www.dol.gov/general/aboutdol/> .
@prefix ns28:   <http://identifiers.org/biocyc/PLASMOCYC:> .
@prefix ns28:   <https://echa.europa.eu/web/guest/> .
@prefix ns29:   <http://identifiers.org/biocyc/MTBH37RVCYC:> .
@prefix ns2:    <http://data.epo.org/linked-data/def/cpc/> .
@prefix ns2:    <http://data.epo.org/linked-data/def/ipc/> .
@prefix ns2:    <http://data.epo.org/linked-data/def/patent/> .
@prefix ns2:    <http://purl.bioontology.org/ontology/NDFRT/> .
@prefix ns2:    <http://rdf.ncbi.nlm.nih.gov/pubchem/patentassignee/> .
@prefix ns2:    <http://rdf.ncbi.nlm.nih.gov/pubchem/patentinventor/> .
@prefix ns30:   <http://identifiers.org/biocyc/MTBCDC1551CYC:> .
@prefix ns30:   <https://www.targetmol.com/> .
@prefix ns31:   <http://identifiers.org/biocyc/HPYCYC:> .
@prefix ns31:   <https://uk.linkedin.com/in/> .
@prefix ns32:   <http://biochemistry.med.uky.edu/users/> .
@prefix ns32:   <http://identifiers.org/biocyc/ECOO157CYC:> .
@prefix ns33:   <http://biomed.skku.edu/> .
@prefix ns33:   <http://identifiers.org/biocyc/ECOL199310CYC:> .
@prefix ns34:   <http://identifiers.org/biocyc/CAULOCYC:> .
@prefix ns34:   <https://unidirectory.auckland.ac.nz/profile/> .
@prefix ns35:   <http://identifiers.org/biocyc/ANTHRACYC:> .
@prefix ns35:   <https://www.cancer.gov/about-cancer/treatment/drugs/> .
@prefix ns36:   <http://identifiers.org/biocyc/MOUSECYC:> .
@prefix ns36:   <https://www.ams.usda.gov/datasets/> .
@prefix ns37:   <http://identifiers.org/biocyc/SMANCYC:> .
@prefix ns37:   <https://www.usgs.gov/centers/> .
@prefix ns38:   <http://identifiers.org/biocyc/SCOCYC:> .
@prefix ns38:   <https://www.phmsa.dot.gov/hazmat/erg/> .
@prefix ns39:   <http://identifiers.org/biocyc/BSUBCYC:> .
@prefix ns39:   <https://www.ed.ac.uk/cancer-centre/research/> .
@prefix ns40:   <http://identifiers.org/biocyc/FLYCYC:> .
@prefix ns40:   <https://ncats.nih.gov/> .
@prefix ns41:   <http://identifiers.org/biocyc/PCHRCYC:> .
@prefix ns41:   <https://www.nist.gov/> .
@prefix ns42:   <http://identifiers.org/biocyc/SYNELCYC:> .
@prefix ns42:   <https://www.usgs.gov/centers/nmic/> .
@prefix ns43:   <http://identifiers.org/biocyc/ECOL316407CYC:> .
@prefix ns43:   <https://www.wikipathways.org/index.php/> .
@prefix ns44:   <http://identifiers.org/biocyc/ECOL413997CYC:> .
@prefix ns44:   <https://b-u.ac.in/19/> .
@prefix ns45:   <http://identifiers.org/biocyc/CAULONA1000CYC:> .
@prefix ns45:   <https://www.cancer.gov/about-cancer/treatment/> .
@prefix ns46:   <http://identifiers.org/biocyc/MOB3BCYC:> .
@prefix ns46:   <https://www.hopaxfc.com/> .
@prefix ns47:   <http://identifiers.org/biocyc/10403SCYC:> .
@prefix ns47:   <https://www.grapeking.com.tw/en/rd/> .
@prefix ns48:   <http://identifiers.org/biocyc/THAPSCYC:> .
@prefix ns48:   <https://medicine.iu.edu/research-centers/> .
@prefix ns49:   <http://identifiers.org/biocyc/CLOSSACCYC:> .
@prefix ns49:   <https://www.who.int/about/policies/publishing/> .
@prefix ns4:    <http://identifiers.org/glytoucan/> .
@prefix ns4:    <http://purl.bioontology.org/ontology/NDFRT/> .
@prefix ns4:    <http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/> .
@prefix ns4:    <https://www.lsi.umich.edu/science/centers-technologies/> .
@prefix ns50:   <http://identifiers.org/biocyc/CALBICYC:> .
@prefix ns50:   <https://www.usgs.gov/> .
@prefix ns51:   <http://identifiers.org/biocyc/BTHECYC:> .
@prefix ns51:   <https://www.uniprot.org/help/> .
@prefix ns52:   <http://identifiers.org/biocyc/ERECCYC:> .
@prefix ns52:   <https://omim.org/help/> .
@prefix ns53:   <http://identifiers.org/biocyc/CORYNECYC:> .
@prefix ns53:   <https://clinicaltrials.gov/ct2/about-site/terms-conditions#> .
@prefix ns54:   <http://identifiers.org/biocyc/DESPIGERCYC:> .
@prefix ns54:   <https://bard.nih.gov/BARD/about/> .
@prefix ns55:   <http://identifiers.org/biocyc/MSMI420247CYC:> .
@prefix ns55:   <https://www.jlab.org/> .
@prefix ns56:   <https://www.nist.gov/pml/> .
@prefix ns57:   <http://www.dgidb.org/> .
@prefix ns58:   <https://www.rcsb.org/pages/> .
@prefix ns59:   <http://pfam.xfam.org/> .
@prefix ns5:    <http://rdf.ncbi.nlm.nih.gov/pubchem/cell/> .
@prefix ns5:    <http://rdf.ncbi.nlm.nih.gov/pubchem/reaction/> .
@prefix ns5:    <http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/> .
@prefix ns5:    <https://www.mpi-dortmund.mpg.de/> .
@prefix ns60:   <https://scholars.uab.edu/display/> .
@prefix ns61:   <https://www.epa.gov/risk/> .
@prefix ns62:   <https://www.ugent.be/we/orgchem/nmr-structure-analysis/> .
@prefix ns63:   <http://www.cvmbs.colostate.edu/DirectorySearch/Search/MemberProfile/cvmbs/3484/Belisle/> .
@prefix ns64:   <https://www.ibch.ru/en/structure/groups/> .
@prefix ns65:   <https://www.ema.europa.eu/> .
@prefix ns66:   <https://www.cancer.umn.edu/for-researchers/shared-resources/> .
@prefix ns67:   <https://www.dea.gov/> .
@prefix ns68:   <https://www.researchgate.net/profile/> .
@prefix ns69:   <https://med.nyu.edu/faculty/> .
@prefix ns6:    <https://www.catalogueoflife.org/data/taxon/> .
@prefix ns6:    <https://www.niddk.nih.gov/research-funding/research-programs/> .
@prefix ns70:   <https://dehs.umn.edu/department-environmental-health-safety/> .
@prefix ns71:   <http://chemical.milliken.com/categories/> .
@prefix ns72:   <https://wwwen.uni.lu/lcsb/research/> .
@prefix ns73:   <https://www.wikidata.org/wiki/Wikidata:> .
@prefix ns74:   <https://icahn.mssm.edu/profiles/> .
@prefix ns75:   <https://medicine.temple.edu/> .
@prefix ns76:   <http://www.hanmipharm.com/ehanmi/handler/> .
@prefix ns77:   <https://www.funakoshi.co.jp/> .
@prefix ns78:   <https://lnv293.wixsite.com/> .
@prefix ns79:   <https://mona.fiehnlab.ucdavis.edu/documentation/> .
@prefix ns7:    <http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/> .
@prefix ns7:    <https://www.epa.gov/chemical-research/> .
@prefix ns80:   <https://www.ema.europa.eu/en/about-us/> .
@prefix ns81:   <https://reactome.org/> .
@prefix ns82:   <http://pathbank.org/> .
@prefix ns83:   <https://plantcyc.org/downloads/> .
@prefix ns84:   <https://www.pharmgkb.org/page/> .
@prefix ns85:   <https://www.justice.gov/> .
@prefix ns86:   <https://www.rhea-db.org/help/> .
@prefix ns87:   <https://www.npatlas.org/> .
@prefix ns88:   <https://glycosmos.org/> .
@prefix ns89:   <https://creativecommons.org/licenses/by/4.0/> .
@prefix ns8:    <http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/> .
@prefix ns8:    <http://www.thermofisher.com/> .
@prefix ns90:   <https://publications.iarc.fr/> .
@prefix ns91:   <https://b-u.ac.in/users/> .
@prefix ns92:   <https://case.edu/medicine/pathology/faculty/> .
@prefix ns93:   <https://www.fda.gov/science-research/liver-toxicity-knowledge-base-ltkb/> .
@prefix ns94:   <https://www.who.int/groups/expert-committee-on-selection-and-use-of-essential-medicines/> .
@prefix ns95:   <https://abdn.pure.elsevier.com/en/equipments/> .
@prefix ns96:   <https://medicine.iu.edu/faculty-labs/> .
@prefix ns97:   <https://www.biosustain.dtu.dk/Research/Application-Areas/> .
@prefix ns98:   <https://www.fda.gov/animal-veterinary/products/> .
@prefix ns99:   <https://clinicalinfo.hiv.gov/en/> .
@prefix ns9:    <http://www.kahedu.edu.in/academic/faculty-of-arts-science-and-humanities/> .
@prefix obo:    <http://purl.obolibrary.org/obo/> .
@prefix owl:    <http://www.w3.org/2002/07/owl#> .
@prefix patentcpc:  <http://rdf.ncbi.nlm.nih.gov/pubchem/patentcpc/> .
@prefix patent: <http://rdf.ncbi.nlm.nih.gov/pubchem/patent/> .
@prefix patentipc:  <http://rdf.ncbi.nlm.nih.gov/pubchem/patentipc/> .
@prefix pathway:    <http://rdf.ncbi.nlm.nih.gov/pubchem/pathway/> .
@prefix pav:    <http://purl.org/pav/> .
@prefix pav: <http://purl.org/pav/2.0/> .
@prefix pdbo:   <http://rdf.wwpdb.org/schema/pdbx-v40.owl#> .
@prefix protein:    <http://rdf.ncbi.nlm.nih.gov/pubchem/protein/> .
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:   <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix reactome:   <http://identifiers.org/reactome/> .
@prefix reference:  <http://rdf.ncbi.nlm.nih.gov/pubchem/reference/> .
@prefix sio:    <http://semanticscience.org/resource/> .
@prefix skos:   <http://www.w3.org/2004/02/skos/core#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix source: <http://rdf.ncbi.nlm.nih.gov/pubchem/source/> .
@prefix substance:  <http://rdf.ncbi.nlm.nih.gov/pubchem/substance/> .
@prefix synonym:    <http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/> .
@prefix uniprot:    <http://purl.uniprot.org/uniprot/> .
@prefix up: <http://purl.uniprot.org/core/> .
@prefix vcard2006:  <http://www.w3.org/2006/vcard/ns#> .
@prefix voag: <http://voag.linkedmodel.org/schema/voag#> .
@prefix vocab:  <http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#> .
@prefix void: <http://rdfs.org/ns/void#> .
@prefix wikidata:   <http://www.wikidata.org/entity/> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
hannahbast commented 2 years ago

Incidentally, which subfolders of https://ftp.ncbi.nlm.nih.gov/pubchem/RDF should we consider? I am currently considering (I left out reaction because it's empty):

bioassay compound/general concept conserveddomain descriptor/compound descriptor/substance disease endpoint gene inchikey measuregroup patent patent/cpc patent/ipc pathway protein reference source substance synonym taxonomy

Confusingly, https://pubchemdocs.ncbi.nlm.nih.gov/rdf$_7 gives a shorter list. It also says there that the total volume is 40 GB (compressed). The total volume of the files I downloaded is 81G (compressed).

donpellegrino commented 2 years ago

I use the following bash function in my scripts:

function pubchemrdf {
    # Collect PubChemRDF triples from NCBI.

    export PUBCHEMRDFDIR="${HOME}"/data/pubchemrdf/20220608

    mkdir --parents "${PUBCHEMRDFDIR}"

    # Implementation is based on script documented in
    # https://pubchemdocs.ncbi.nlm.nih.gov/rdf$_7

    cd "${PUBCHEMRDFDIR}" || return

    wget --mirror \
         --user-agent="My Collector Identifier <myemail@email.address>" \
         --exclude-directories=/pubchem/RDF/compound/nbr2d,/pubchem/RDF/compound/nbr3d \
         --no-host-directories \
         --no-parent \
         --cut-dirs=2 \
         ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/
}

Rather than enumerate all the subfolders to collect, I take the inverse approach and explicitly list what I want to exclude. I don't know what is going on with reaction. I get 81G as well:

$ du -h
13M     ./protein
2.3G    ./endpoint
27M     ./gene
4.4M    ./conserveddomain
7.9G    ./compound/general
7.9G    ./compound
64K     ./source
52M     ./bioassay
12G     ./synonym
12K     ./reaction
3.8M    ./patent/ipc
18M     ./patent/cpc
16G     ./patent
25G     ./descriptor/compound
1.6G    ./descriptor/substance
26G     ./descriptor
156K    ./concept
500K    ./disease
4.1G    ./inchikey
2.7G    ./reference
1.2G    ./measuregroup
11G     ./substance
6.7M    ./pathway
6.9M    ./taxonomy
81G     .

I have found the NCBI PubChem Support team to be very helpful and fast to respond. It may be useful to report the inconsistencies you found to them (info@ncbi.nlm.nih.gov).

donpellegrino commented 2 years ago

For PubChemRDF handling, I created a script to manually collect the OWL files from around the Internet for the referenced vocabularies. This is more relevant when executing SPARQL dependent on inference.

How to best handle inference in combination with QLever indexes would be a more complex problem. Applying inference using Apache Jena ARQ on this dataset is not performant. Different SPARQL inference cases have various performance characteristics, but in my experience, it seems to be a problem needing more research.

NCBI seems to be running Virtuoso as their internal triple store implementation.

donpellegrino commented 2 years ago

I am using this pubchem.settings.json (reduce the batch size if you run out of memory):


{ "ascii-prefixes-only": false, "num-triples-per-batch": 10000000 }

Would you suggest making the PubChemRDF Subdomain Namespaces (https://pubchemdocs.ncbi.nlm.nih.gov/rdf$_3 - table 2) as external vocabulary as described in https://github.com/ad-freiburg/qlever/wiki/Internal-and-external-vocabulary ?

hannahbast commented 2 years ago

@ external vocabulary: I can tell you more when my current build has finished. By default, all literals longer than 1024 bytes and all language literals (with an @.. suffix) are placed in the external vocabulary. Maybe that's already enough for PubChem, I will know later today.

I have another question concerning the way PubChemRDF uses reification. Here is a typical example: https://pubchem.ncbi.nlm.nih.gov/rest/rdf/compound/CID60823.html . Instead of there being predicates for "molecular weight", "total form charge", "tautomer count", etc. there is one universal predicate "has-attribute" with objects that are a combination of the subject ID and the property, for example: https://pubchem.ncbi.nlm.nih.gov/rest/rdf/descriptor/CID60823_Molecular_Weight.html .

This has two big disadvantages. First, you have a huge and very unspecific predicate "has-attribute", which is inefficient to handle and encompasses all kinds of properties. Second, it makes it quite complicated to formulate queries (or parts of queries) that, for example, constrain the molecular weight.

I wonder why the PubChem people modeled their data that way, given that there is a rather obvious alternative. Didn't they know better or is there any advantage of the current modeling?

donpellegrino commented 2 years ago

I have only been a user of PubChemRDF, so I am unable to speak authoritatively on the design choices made by the implementers. There are a few publications with background on the use of 'has-attribute' that I list below. In addition, my recent communications with the PubChemRDF team indicate that they are currently doing redesign work. Now would be an excellent time to provide recommendations on changes to the reification.

I suspect much of the early reification design work on PubChemRDF, and even current design work on other large triple collections could benefit from better design patterns that relate choices in predicate selection to the performance attributes at query time by way of the indexing consequences. It could be that some of the historical choices were shaped by RDBMS/SQL-backed triple store implementations and federated analysis approaches, rather than more specialized SPO-permutation/SPARQL-optimized approaches.

The issue seems to be most directly addressed in Fu, 2015:

"The chemical descriptors serve as quantified attributes to describe PubChem Compound and Substance records. The PubChemRDF design utilizes object properties sio:has-attribute and sio:has-value to specify the relations between the chemical entities and the associated descriptors. SIO is developed to support knowledge representation and reasoning in the scientific research, and the same design pattern has been implemented in the Bio2RDF mash-up system [22, 23] and the Semantic Automated Discovery and Integration (SADI) [50, 51] web service. Re-use of such design patterns across multiple Semantic Web offerings reduces the effort it takes to construct federated queries." [Fu, 2015, p. 12]

PubChemRDF also relates into OBO Foundry. Uses of 'has-attribute' from SIO can be seen in the "Ontologies that use the ObjectProperty" section of https://ontobee.org/ontology/SIO?iri=http://semanticscience.org/resource/SIO_000008. Historically, I recall it was conflated with CHEMINF_200, but the CHEMINF_200 uses may have been cleaned up since I last encountered them.

References:

hannahbast commented 2 years ago

@donpellegrino My index build is now finished and the respective QLever instance is online here: https://qlever.cs.uni-freiburg.de/pubchem . The index build took 20 hours and the number of triples is 14.2 B. The Qleverfile containing all information (for downloading the data, building the index, and starting the server) is here: https://github.com/ad-freiburg/qlever-control/blob/main/Qleverfiles/Qleverfile.pubchem . I have two questions, maybe you can help:

  1. I have added the ten example queries from https://pubchemdocs.ncbi.nlm.nih.gov/rdf$_8 as example queries (icon "Examples" in the QLever instance). However, almost all of them give an empty result. For the very first query, already the first two triples give an empty result: ?sub rdf:type obo:CHEBI_53289 ; obo:RO_0000056 ?mg . Am I overlooking something here?

  2. When playing around with the Qlever instance, I realized that the names of most of the entities are not part of the dataset. For example, compound:CID60823 ("atorvastatin") or CHEBI_53289 ("donepezil") or attribute types like resource:CHEMINF_000334 ("molecular weight"). Is there a list of RDF datasets somewhere, which contain these labels? It would be very useful for the QLever UI, which could then show the names of things in a query or when suggesting something.

PS: Most suggestions in the QLever UI are currently context-insensitive (shown in blue), that is, they do not depend on the parts of the query already typed. The reason is the huge predicates like sio:has-attribute and rdf:type, as discussed above. We will look more into that, it's an interesting new use case for us.

donpellegrino commented 2 years ago

@hannahbast - Congratulations on the index build! That is a significant achievement. The PubChemRDF core subset is beyond the limits of many other triple stores.

Building within Wall-Clock/Memory/CPU/Interconnect constraints

I would love to reproduce your result, but the cluster I have been using limits jobs to 6 hours of wall-clock time. Nodes have 192 GB RAM, 1 GbE storage I/O, and 2x Intel(R) Xeon(R) Gold 6128 CPUs. I could try loading separate subsets into separate indexes and federating QLever's SPARQL endpoints per Issue #710. Please let me know if you have any other methodology to recommend. I have to study the indexing code, but if you can recommend a function point to look at that could potentially be decomposed and parallelized with MPI, that might be another approach I could experiment with.

Scaling RDF indexes across nodes was done with Cray Graph Engine and their technique is described in:

PubChemRDF Use Cases/Regression Testing

I had similar issues with reproducibility of the "PubChemRDF Use Cases" example queries. I have a request pending with info@ncbi.nlm.nih.gov that example results be provided for the current release, but that has not yet been implemented. You may want to email them as well to highlight interest in that work. The Use Cases could be a good way to verify and baseline bulk loading and query operations.

Query/OWL/Inference-dependencies

For CID60823 getting to "atorvastatin" is a traversal through the synonym subdomain: https://pubchem.ncbi.nlm.nih.gov/rest/rdf/synonym/MD5_034e43bbfe9cef861db628ca90afb16f.html The synonym class is of type CHEMINF:000562 ("International Non-proprietary Name"). This is also a good example of how inference is necessary for some queries. CHEMINF_000562 is a subclass of CHEMINF_000041 ("molecular entity name"). It could be useful to users to query on CHEMINF_000041 and get the 'has-value' literals for all synonym individuals in that type hierarchy.

Also note that some of the "PubChemRDF Uses Cases" mention:

"Note: CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset"

To address these cases using triples stores / SPARQL engines that have inference support, I manual collect all the OWL files I can find that are referenced by PubChemRDF and load them as an addition input.

Wiki page

I added a Wiki page that may be useful for distilling the method and results.

hannahbast commented 2 years ago

@donpellegrino Great reply, thanks! You say that you manually "collect all the OWL files [you] can find that are referenced by PubChem RDF". That makes a lot of sense. Can you post a list of these files here? I would then download and add them to the PubChem instance. The URL https://pubchem.ncbi.nlm.nih.gov/rest/rdf/ruleset.html you mention does not work for me ("Bad Query URL"). I will write more about your other questions (in particular: running QLever on machines that have a time limit) later.

donpellegrino commented 2 years ago

I have not yet fully automated the collection, but the following list includes my notes on public URLs I found for various ontologies. There may be ontologies that I missed. The vocabulary, prefix, and filename fields are just for my own tracking and do not have any operational consequence on the workflow. Ideally, the list of URLs below would cross-reference with the list of prefixes and namespaces at https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-PubChem#references, but I have not gotten those fully aligned. SIO and CHEMINF sharing the same namespace create some confusion and prevent one-to-one mapping.

A better way to maintain this list would probably be to use the same technique as Protege (https://protegewiki.stanford.edu/wiki/Importing_Ontologies_in_P41#Protege_and_XML_Catalogs) and perform collection, update, and import operations based on reads of the catalog file.

# Collect with "curl -O -L -J <URL>" if no filename specified.
# Otherwise, "curl -o <filename> -L <URL>"
#
Vocabulary,Prefix,URL,Filename
BAO - BioAssay Ontology,bao,https://data.bioontology.org/ontologies/BAO/submissions/41/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb
BFO - Basic Formal Ontology,bfo,http://purl.obolibrary.org/obo/bfo.owl
BioPAX - biological pathway data,bp,http://www.biopax.org/release/biopax-level3.owl
CHEMINF - Chemical Information Ontology,cheminf,http://purl.obolibrary.org/obo/cheminf.owl
ChEBI - Chemical Entities of Biological Interest,chebi,http://purl.obolibrary.org/obo/chebi.owl
CiTO,cito,http://purl.org/spar/cito.nt
DCMI Terms,dcterms,https://www.dublincore.org/specifications/dublin-core/dcmi-terms/dublin_core_terms.nt
FaBiO,fabio,http://purl.org/spar/fabio.nt
GO - Gene Ontology,go,http://purl.obolibrary.org/obo/go.owl
IAO - Information Artifact Ontology,iao,http://purl.obolibrary.org/obo/iao.owl
NCIt,ncit,http://purl.obolibrary.org/obo/ncit.owl
NDF-RT,ndfrt,https://data.bioontology.org/ontologies/NDF-RT/submissions/1/download?apikey=8b5b7825-538d-40e0-9e9e-5ab9274a9aeb
OBI - Ontology for Biomedical Investigations,obi,http://purl.obolibrary.org/obo/obi.owl
OWL,owl,http://www.w3.org/2002/07/owl,owl.ttl
PDBo,pdbo,http://rdf.wwpdb.org/schema/pdbx-v40.owl
PR - PRotein Ontology (PRO),pr,http://purl.obolibrary.org/obo/pr.owl
RDF Schema,rdfs,https://www.w3.org/2000/01/rdf-schema,rdf-schema.ttl
RDF,rdf,http://www.w3.org/1999/02/22-rdf-syntax-ns,22-rdf-syntax-ns.ttl
RO - Relation Ontology,ro,http://purl.obolibrary.org/obo/ro.owl
SIO - Semanticscience Integrated Ontology,sio,http://semanticscience.org/ontology/sio.owl
SKOS,skos,http://www.w3.org/TR/skos-reference/skos.rdf
SO - Sequence types and features ontology,so,http://purl.obolibrary.org/obo/so.owl
UO - Units of measurement ontology,uo,http://purl.obolibrary.org/obo/uo.owl

The QLever examples I have seen so far require normalizing to a Turtle serialization before passing to the indexer. Therefore, it may be necessary to run all the files through Apache Jena's riot tool or some other format converter, unless I misunderstood QLever's ability to process multi-format input.

I ignored XML Schema since that seems to be handled internally by triple stores.

XML Schema,http://www.w3.org/2001/XMLSchema#,xsd,http://www.w3.org/2001/XMLSchema#

For some ontologies, I opted for the version published by the OBO Foundry/Ontobee rather than the original publisher. I failed to make note of where I found that decision to be applicable. Adding a field to track that decision point would improve the fidelity of the list.

hannahbast commented 2 years ago

Here are a few comments concerning your question about how to build a QLever index when machine time is limited:

  1. This is a rather unusual constraint. Have you considered simply buying your own machine? One of the design principles of QLever is to be as efficient as possible and not to depend on expensive resources. Our standard machine for testing (with an AMD Ryzen 9 5900X 12-core processor) cost just around 1000 €, plus some money for the disks (which are cheap, HDDs are good enough). That's in the range of what most people can afford, even privately.

  2. Federation is something we are very interested in anyway and this looks like a great use case (though, of course, a single index is always better if you can build it). It will not happen in the next days, but we will hopefully start working on this in the next weeks. And my hope is that it shouldn't be too much work to get a first version working. The challenge will be to do it efficiently. One of course wants to avoid having to send and parse large amounts of verbose RDF data between instances.

  3. QLever's index building proceeds in stages, and each stage writes files, which the next stage then reads. So in principle, it should not be too hard to extend the index builder by functionality so that one can suspend it after any given stage and then resume it again. That way, each stage could be completed in a separate job. However, it's important that the (many and big) files) are preserved from one stage to the next. If this is very important for you, you might want to try to write a PR for it.

  4. Distributing the data over several machines is actually something we actively try to avoid with QLever. When the data is enormous, this can be unavoidable, but when the data fits on a single machine and you distribute it over several machines despite of that, you pay a rather large performance penalty. And we haven't encountered a dataset yet that doesn't fit on a single machine, and often even quite normal machines. For example, the PubChem core dataset can be built on a 1000 € machine (see above) using only around 40 GB of RAM. Same for the complete Wikidata.