eMetaboHUB / Forum-DiseasesChem

A Knowledge Graph from public databases and scientific literature to extract associations between chemicals and diseases.
Other
16 stars 6 forks source link
diseases human metabolites metabolomics

FORUM Knowledge graph Database

FORUM is an open knowledge network aiming at supporting metabolomics results interpretation in biomedical sciences and automated reasoning. Containing more than 8 billion statements, it federates data from several life-science resources such as PubMed, ChEBI and PubChem. Leveraging the bridges in this linked dataset, we automatically extract associations between compound and biomedical concepts, using literature metadata enrichment. Find more about the method in our preprint here: BioXiv

The FORUM content can be exploited through this portal in two ways:

The endpoint can also be accessed programmatically and built from source to support further developments. The source code for the Knowledge Network creation and computation of the association can be found on this repo here. Do not hesitate to contact us at [semantics-metabolomics AT inrae DOT fr] for more information.

The FORUM project is supported by INRAE, France's National Research Institute for Agriculture, Food and Environment, The H2020 Goliath project and the INRAE CATI EMPREINTE.

The FORUM team:

About the association results

FORUM provides well grounded associations between MeSH terms and compounds, through their PubChem Compound identifier (CID). FORUM also provide associations with chemical classes using ChEBI and ChemOnt ontologies (note that classes describing a single compound are ignored, as well as the broadest ones). FORUM choose to retain only the strongest associations by applying stringent inclusion criteria, thus, please bear in mind that the absence of an association do not mean a non-association.

The strength of an association is estimated from the frequency of compound mention and biomedical topic co-occurrence in PubMed article. We test for independence using right-tailed Fisher Exact test adjusted for multiple comparisons using the Benjamini-Hochberg procedure, and report the obtained q-value.

We also report the Odds ratio to gauge the relative effect size, as well as the raw number of papers mentioning both the compound and the biomedical topic.

We identify weak associations by computing a confidence interval on the co-occurrence proportion. For identified weak associations, you can get more details by hovering the (i) icon to display a measure of their weakness, which represent the minimum number of supporting articles withdraw that would make the association fall below our inclusion criteria. See our preprint for more details.

The results provide associations with most domains of the MeSH thesaurus. The MeSH root allows to easily filter out by top-level categories:

The remaining categories are ignored during computation, but can nonetheless appear in the results for terms belonging to multiple categories that include at least one of the above. More fine-grained filtering can be done from the MeSH tree numbers provided in the export file.

License

The FORUM association dataset is publically available without license restrictions. Licensing information of external resources used by FORUM can be accessed at the links below:

Acknowledgement

We thank the teams behind PubChem, PubMed, MeSH, CheBI, MetaNetX and ChemOnt for gratefully providing open data as well as great support for their use, which has made FORUM possible. The FORUM project is not affiliated with any of the cited source. The FORUM project is supported by INRAE, France's National Research Institute for Agriculture, Food and Environment, The H2020 Goliath project and the INRAE CATI EMPREINTE.

Technical information

1 - Install environment

1.1 - Install Docker:

Follow instructions at https://docs.docker.com/engine/install/ubuntu/

1.2 - Install Virtuoso Docker container :

Before building the triplestore, you should create 5 directories:

So, for instance, you can execute:

mkdir data
mkdir -p docker-virtuoso/share
mkdir logs
mkdir config

Two possibility to build the triplestore:

If you want to use the docker image, first build it :

docker build -t forum/processes \
  --build-arg USER_ID=$(id -u) \
  --build-arg GROUP_ID=$(id -g) .

This allow to build an image with correct permissions that correspond to the local host. (https://vsupalov.com/docker-shared-permissions/)

In this container, three directories are intented to be bind with the host:

Then, you can launch it using:

docker run --name forum_scripts --rm -it --network="host" \
-v /path/docker-virtuoso/share:/workdir/share-virtuoso \
-v /path/to/data/dir:/workdir/out \
-v /path/to/log/dir:/workdir/logs-app \
-v /path/to/config/dir:/workdir/config \
forum/processes bash

or in detach mode :

docker run --detach --name forum_scripts --rm -t --network="host" \
-v /path/docker-virtuoso/share:/workdir/share-virtuoso \
-v /path/to/data/dir:/workdir/out \
-v /path/to/log/dir:/workdir/logs-app \
-v /path/to/config/dir:/workdir/config \
forum/processes bash

When using detach mode, the container is running in the background, which can be really convinient to avoid Broken pipe, for instance if your are working on a server. You can open an interactive bash shell on the container running in the background by using :

docker exec -it forum_scripts bash

You can then navigate in the container (like in a classic docker) to modify configuration files, make tests on scripts, check mount directories, etc ...

Finally, all commands can be launch in a detach mode from the host, like :

docker exec --detach forum_scripts ./command -param v1 -param2 v2 ...

eg.

docker exec --detach forum_scripts ./workflow/w_computation.sh -v version -m /path/to/config/Compound2MeSH -t path/to/config/triplesConverter/Compound2MeSH -u CID_MESH -d /path/to/data/dir -s /path/to/virtuoso/share/dir -l /path/to/log/dir

This command will be execute in the container, running in the background.

Finally, the forum/processes container must not be used to start/stop/clean the Virtuoso triplestore (See 3.1)

Warnings: Be sure to map the docker-virtuoso/share and the data directory inside your forum/processes container. Also, if you use the docker forum/processes, you should use in your commands, directories that bind on the previously created directories: out and share-virtuoso, instead of using data and docker-virtuoso/share in the next examples.

If you want to restart an analysis from scratch, be sure to remove all logs before!

2 - Prepare the triplestore

The building of the KG from scratch can take several days, including the download of the raw data, and the computation of relations between chemical compounds/classes and MeSH descriptors. The building of the KG was achieved on a server using 189GB of memory and 12 cpus.

We deployed the FORUM KG using Virtuoso on a server with 16 cpus and 128 GB of memory. We strongly recommend to deploy it on a SSD-type storage as it can take more than 20 days on a classic storage (that's not a joke).

Also, all metadata related to the FORUM KG are provided in a VoID file accessible at https://forum.semantic-metabolomics.fr/.well-known/void and directly queryable on the SPARQL endpoint.

To build the initial triplestore, you can use the scripts provided in the build directory or directly download the compressed share directory of the current release on the ftp server

2.1 - Build the triplestore

2.1.1 - The core triplestore

To build the triplestore, several scripts are available, each dedicated to a specific FORUM resource. Workflow scripts describing all steps of the current release construction are also availables in the workflow directory.

All the scripts use to create/import resources create at least:

In the following sections, all example commands are provided as like they are use in the forum_scripts Docker.

Configuration files for all scripts used in the current release are provided in the config directory.

The vocabularies:

The vocabulary directory contains files associated to the schema of used ontology, they can be download using the docker resource directory or at:

The MeSH vocabulary file has been downloaded from the 2021 release of MeSH.

The ChEBI ontology file is often updated and the actual version of the ChEBI ontology used in the triplestore is: ChEBI Release version 205 (Release of 03 Nov. 2021), as refer in the URI of the ChEBI Graph in FORUM.

Warnings: For ChemOnt, ontology file was downloaded at http://classyfire.wishartlab.com/downloads, but to be loaded in Virtuoso, the file need to be converter in an other format than .obo. Using Protege (https://protege.stanford.edu/) ChemOnt_2_1.obo was converted in a turtle format and ChemOnt_2_1.ttl. The ChemOnt ontology seems to be stable.

To download the vocabulary files along with their upload file, use import_vocabulary.sh

The upload file also loads the namespaces in Virtuoso.

bash app/build/import_vocabulary.sh -s /workdir/share-virtuoso -f ftp.semantic-metabolomics.org:upload_2021.tar.gz -u forum -p Forum2021Cov!
MetaNetX

To import the MetaNetX resource in FORUM, use: import_MetaNetX.py

python3 -u app/build/import_MetaNetX.py --config="/workdir/config/release-2021/import_MetaNetX.ini" --out="/workdir/share-virtuoso" --log="/workdir/logs-app"
Config
MeSH

To import the MeSH resource in FORUM, use: import_MeSH.py

python3 -u app/build/import_MeSH.py --config="/workdir/config/release-2021/import_MeSH.ini" --out="/workdir/share-virtuoso" --log="/workdir/logs-app"
Config
PubChem

See https://pubchemdocs.ncbi.nlm.nih.gov/rdf

PubChem data are divided into subsets: compounds, descriptors, reference ...

Each subset is relevant but, depending on the requests, not all subsets (and all data files in a subset) need to be loaded at the same time. The script import_PubChem.py allow to download several PubChem subsets, but only load some of them, by specifying the file mask.

This script is therefore use two times:

python3 -u app/build/import_PubChem.py --config="/workdir/config/release-2021/import_PubChem_min.ini" --out="/workdir/share-virtuoso" --log="/workdir/logs-app"
Config
PMID-CID

To integrate the linkSet PMID-CID providing triples that an article (pmid) discusses a PubChem Compound (CID), use: import_PMID_CID.py

Being a LinkSet, a valid path to the directories of the targeted PubChem compound and reference graph must also be provided.

This script produced two subset:

There are 3 types of contributors (source https://doi.org/10.1186/s13321-016-0142-6):

python3 -u app/build/import_PMID_CID.py --config="/workdir/config/release-2021/import_PMID_CID.ini" --out="/workdir/share-virtuoso" --log="/workdir/logs-app"

As it is a long provess, you can use the following commands inside the detached container (forum_script) to output STDOUT in a log file.

echo "" > logs-app/global_log_PMID_CID.log
python3 -u app/build/import_PMID_CID.py --config="/workdir/config/release-2021/import_PMID_CID.ini" --out="/workdir/share-virtuoso" --log="/workdir/logs-app" 2>&1 | tee -a logs-app/global_log_PMID_CID.log &
Config
Chemont

In order to provide a MeSH enrichment from ChemOnt classes, the goal of this procedure is to retrieve ChemOnt classes associated to PubChem compounds, using only those for which a literature is available. The literature information is extracted from PMID - CID graphs, while the InchiKey annotation from PubChem InchiKey graphs.

A valid path to the directories of the targeted PubChem compound and PubChem InchiKey for annotations must be provided.

ChemOnt classes associated to a PubChem compound are accessible through their InchiKey at the URL http://classyfire.wishartlab.com/entities/INCHIKEY.json

For a molecule, ChemOnt classes are organised in 2 main categories:

These both types of classes are stored separately in two different graphs.

Djoumbou Feunang, Y., Eisner, Wishart, D.S., 2016. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J Cheminform 8, 61. https://doi.org/10.1186/s13321-016-0174-y

How to run:

python3 -u app/build/import_Chemont.py --config="config/release-2021/import_Chemont.ini" --out="/workdir/share-virtuoso" --log="/workdir/logs-app"

Config file

2.1.2 - Integration of metabolic networks.

See docs/sbml.md

An workflow example file from the current release to import a SBML file with all the needed annotation graphs in Virtuoso is provided in the workflow directory

2.2 - Or ... Download RDF files from FTP

example :

sftp forum@ftp.semantic-metabolomics.org:/dumps/2021/share.tar.gz

All data and results can be downloaded from the sftp server.

We plan to update the FORUM Knowledge graph every year.

3 - Compute chemical entities to MeSH associations

Once the initial data of the triplestore have been created, an initial session of the Virtuoso triplestore must be started in order to compute associations between chemical entities and MeSH descriptors.

Recommendations:

You may need to disable "Strict checking of void variables" in the SPARQL query editor when you use transitivity in queries.

3.1 - Virtuoso Triple store

3.1.1 - Initialyze the Virtuoso session

Warning: The management script of the triplestore Virtuoso, w_virtuoso.sh, must be run directly on the host, without using the forum docker (forum/processes). Indeed, while starting the forum/processes container, the option --network="host" will allows that the container will use the host’s networking.

To start the virtuoso session, use:

bash workflow/w_virtuoso.sh -d /path/to/virtuoso/dir -s share -c start upload1.sh upload2.sh ...

e.g

The current configuration deploy a Virtuoso triplestore on 64 GB (see NumberOfBuffers and MaxDirtyBuffers), also dedicating 8 GB per SPARQL query for computation processes (see MaxQueryMem). This configuration can be modify in the w_virtuoso.sh script.

Warnings: In the provided configuration, the port used by the docker-compose holding the Virtuoso triplestore is 9980. Thus, the url used to request the KG during the computation is http://localhost:9980/sparql/. So if you change the port in the docker-compose.yml, be sure to also changed it in the configuration file for requesting the endpoint.

When use start to create a new triplestore, pass to the command the list of the upload files for the data you want to load.

For instance, to load the vocabulary, MeSH, PubChem and PMID_CID, from the current release configuration files and compute associations, use:

bash workflow/w_virtuoso.sh -d /path/to/virtuoso/dir -s share -c start upload.sh upload_PMID_CID.sh upload_MeSH.sh upload_PubChem_minimal.sh

3.2 - Set configuration files:

For each analysis, there are two main configuration files:

3.3 - Computation

To compute associations between chemical entities and MeSH descriptors, you can use: w_computation.sh

Option details:

For each analysis: all results and intermediary data will be exported in a dedicated sub-directory named as the resource (option u). In this sub-directory, you can find count data associated with Chemical entities, MeSH descriptors and their co-occurrences (eg. directory MESH_PMID). In the sub-directory results, you can find the table that resume counts for each association. This table is used later to compute Fisher exact tests on each association. From this table to the final result table containing all statistical values, there are several intermediary files produced.

These intermediary files are:

At the end of this procedure all significant associations, according to the threshold in the configuration file, are converted in a triple formalism to be instantiated in the knowledge graph. See details of the procedure in the README of Analyzes/Enrichment_to_graph.

RQ: To release memory for Virtuoso temp files and others, you can also stop and start again the triplestore using the w_virtuoso.sh script between each computation.

Some example of commands that can be used to compute each analysis are shown below, using default values options c,p,o,i :

For more details on the computation processes see computation.md in docs.

The checkpointing of Virtuoso can disturb the requesting processes, because during the checkpoint the database will be inaccessible.

To avoid checkpointing when computing associations, you should disable the checkpointing.

A recommended plan is to :

docker exec -it $CONTAINER_NAME bash
isql-v -U dba -P FORUM
checkpoint_interval(-1);
exit;
docker exec -it $CONTAINER_NAME bash
isql-v -U dba -P FORUM
checkpoint;
exit;

When the triplestore is restarted to compute other associations, there should be no roll forward.

3.3.1 - Compute PubChem compounds - MeSH associations
./workflow/w_computation.sh -v version -m /path/to/config/Compound2MeSH -t path/to/config/triplesConverter/Compound2MeSH -u CID_MESH -d /path/to/data/dir -s /path/to/virtuoso/share/dir -l /path/to/log/dir

eg.

./workflow/w_computation.sh -v 2021 -m config/release-2021/computation/CID_MESH/config.ini -t config/release-2021/enrichment_analysis/config_CID_MESH.ini -u CID_MESH -d /workdir/out -s /workdir/share-virtuoso -l /workdir/logs-app
3.3.2 - Compute ChEBI - MeSH associations
./workflow/w_computation.sh -v version -m /path/to/config/ChEBI2MeSH -t path/to/config/triplesConverter/ChEBI2MeSH -u CHEBI_MESH -d /path/to/data/dir -s /path/to/virtuoso/share/dir -l /path/to/log/dir

eg.

./workflow/w_computation.sh -v 2021 -m config/release-2021/computation/CHEBI_MESH/config.ini -t config/release-2021/enrichment_analysis/config_CHEBI_MESH.ini -u CHEBI_MESH -d /workdir/out -s /workdir/share-virtuoso -l /workdir/logs-app
3.3.3 - Compute Chemont - MeSH associations
./workflow/w_computation.sh -v version -m /path/to/config/Chemont2MeSH -t path/to/config/triplesConverter/Chemont2MeSH -u CHEMONT_MESH -d /path/to/data/dir -s /path/to/virtuoso/share/dir -l /path/to/log/dir

eg.

./workflow/w_computation.sh -v 2021 -m config/release-2021/computation/CHEMONT_MESH/config.ini -t config/release-2021/enrichment_analysis/config_CHEMONT_MESH.ini -u CHEMONT_MESH -d /workdir/out -s /workdir/share-virtuoso -l /workdir/logs-app
3.3.4 - Compute MeSH - MeSH associations
./workflow/w_computation.sh -v version -m /path/to/config/MeSH2MeSH -t path/to/config/triplesConverter/MeSH2MeSH -u MESH_MESH -d /path/to/data/dir -s /path/to/virtuoso/share/dir -l /path/to/log/dir

eg.

./workflow/w_computation.sh -v 2021 -m config/release-2021/computation/MESH_MESH/config.ini -t config/release-2021/enrichment_analysis/config_MESH_MESH.ini -u MESH_MESH -d /workdir/out -s /workdir/share-virtuoso -l /workdir/logs-app

Rq: The computation of relations between MeSH descriptors is a particular case, for which the sparql request imposes supplementary filters. Thus, we only compute associations for MeSH descriptors that belong in a sub set of MeSH Trees that do not represent chemicals, as this would be redundant with the CID-MESH analysis, or Organisms, as only few entities are correctly represented in our KG. The list of MeSH tree codes is C|A|G|F|I|J|D20|D23|D26|D27. Secondly, we also look for relations that do not involved a parent-child relation (in both ways) between the requested MeSH and the MeSH found.

3.3.5 - Compute SPECIE - MeSH associations

Metabolic networks data (workflow/w_upload_metabolic_network.sh) eg.

workflow/w_computation.sh -v version -m path/to/Specie2MeSH/config/file -t path/to/config/triplesConverter/Specie2MeSH -u SPECIE_MESH -d /path/to/data/dir -s /path/to/virtuoso/share/dir -l /path/to/log/dir

eg.

workflow/w_computation.sh -v 2021 -m config/release-2021/computation/SPECIE_MESH_Thesaurus/config.ini -t config/release-2021/enrichment_analysis/config_SPECIE_MESH.ini -u SPECIE_MESH -d /workdir/out -s /workdir/share-virtuoso -l /workdir/logs-app

3.4 - Shutdown Virtoso session

When all computations have been achieved, the temporary Virtuoso session can be down, using:

./workflow/w_virtuoso.sh -d /path/to/virtuoso/dir -s /path/to/share/dir/from/virtuoso/dir stop
./workflow/w_virtuoso.sh -d /path/to/virtuoso/dir -s /path/to/share/dir/from/virtuoso/dir clean

eg.:

workflow/w_virtuoso.sh -d ./docker-virtuoso -s share stop
workflow/w_virtuoso.sh -d ./docker-virtuoso -s share clean

Note: the Virtuoso session could be stop directly after the counts calculation and is not necessary for the post-processes.

New directory EnrichmentAnalysis should have been created at the end of the process in the virtuoso shared directory, instantiating associations between chemical entities and MeSH in a triple formalism, which then can be load in a Virtuoso triplestore to explore relations.

In the data directory, you can also retrieved all processed results, such as the final results table: r_fisher_q_w.csv in each related directory

4 - Create the master Void

Use create_master_void.py

python3 -u app/build/create_master_void.py --config="/workdir/config/release-2021/master_void.ini" --out="/workdir/share-virtuoso"
Config file

5 - Monitoring

Several checks can be used to ensure that the loading was done correctly:

1) At the end of each loading file, Virtuoso execute the command select from DB.DBA.LOAD_LIST where ll_error IS NOT NULL;*. Basically, it asks Virtuoso to return graphs for which there was an error during rdf loading. Check that this request doesn't return any results (Virtuoso Bulk Loading RDF)

The FORUM triplestore is built from both triples created and collected from web services (eg. PMID_CID, CHEMONT) and aggregated from different external resources (eg. PubChem). In this way, inconsistency is the data can comes from different issues. Some advices are provided to detect and quantity potential errors or lack in the data:

6 - MeSH, Chemont, ChEBI and CID labels

Identifiers are not always convenient to explore results and therefore, labels of MeSH descriptors, Chemont and ChEBI classes, or PubChem compounds can be more useful. To retrieve labels of MeSH descriptors and SCR, Chemont and ChEBI classes, you can use the SPARQL endpoint by sending requests as indicated in the labels.rq file. Unfortunately, this can't be done for PubChem compounds as labels are not part of PubChem RDF data, only the IUPAC name being specify, but those can be retrieved using the pubchem identifier exchange. Extract all the PubChem identifiers for which their is literature (and so potentially associations) and upload it in the service to get labels. Label files are also provided on the sftp server (See on web-portal).