This Groovy 4.0 script creates a Derby file for BridgeDb [1,2] for use in PathVisio, etc.
The script has been tested with HMDB 5.0 [3,4,5,6], ChEBI 211 [7], and Wikidata [8,9] from 2022.
I'm indebted to all that worked on identifier mappings in these projects:
Everyone can contribute ID mappings to the latter project.
The files are released via the BridgeDb Website: https://bridgedb.github.io/data/gene_database/
The mapping files are also archived on Figshare: https://figshare.com/projects/BridgeDb_metabolites/28500
This repository: New BSD.
Derby License -> http://db.apache.org/derby/license.html BridgeDb License -> http://www.bridgedb.org/browser/trunk/LICENSE-2.0.txt
Update the createDerby.groovy file with the new version number of ChEBI and createDerby.groovy file for HMDB ("DATASOURCEVERSION" field). This information is stored as metadata, and needed for example in the BridgeDb webservice to correctly display which data is in the mapping file.
make sure the HMDB data file is saved as hmdb_metabolites.zip
and to create a new
zip file will each metabolite in separate XML file:
mkdir hmdb
wget http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip
unzip hmdb_metabolites.zip
cd hmdb
cp ../hmdb_metabolites.xml .
xml_split -v -l 1 hmdb_metabolites.xml
rm hmdb_metabolites.xml
cd ..
zip -r hmdb_metabolites_split.zip hmdb
mkdir data
cd data
wget ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names.tsv.gz
gunzip names.tsv.gz
mv names.tsv chebi_names.tsv
wget ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/database_accession.tsv
mv database_accession.tsv chebi_database_accession.tsv
4.1 ID mappings
A set of SPARQL queries have been compiled and saved in the wikidata/ folder. These queries can be manually executed at http://query.wikidata.org/. These queries download mappings from Wikidata for CAS registry numbers (cas.rq), ChemSpider (cs.rq), PubChem (pubchem.rq), KEGG compounds (kegg.rq), KnAPSaCK IDs (knapsack.rq) [10], and LIPID MAPS [11].
However, you can also use the below curl command line operations.
curl -H "Accept: text/csv" --data-urlencode query@wikidata/cas.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o cas2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/cs.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o cs2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/pubchem.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o pubchem2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/chebi.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o chebi2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/kegg.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o kegg2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/hmdb.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o hmdb2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/lm.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o lm2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/knapsack.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o knapsack2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/comptox.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o comptox2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/iuphar.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o gpl2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/chembl.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o chembl2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/drugbank.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o drugbank2wikidata.csv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/swisslipids.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o swisslipids2wikidata.csv
Thanks to Jerven Bolleman for their SPARQL endpoint work at SIB (which we used before) and to the Freiburg QLever team.
4.2 Get compound labels and InChIKeys
With a similar SPARQL query (names.rq) the compounds labels (English only) and InChIKeys can be downloaded as simple TSV and saved as "names2wikidata.tsv" (note that this file is TAB separated):
curl -H "Accept: text/tab-separated-values" --data-urlencode query@wikidata/names.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o names2wikidata.tsv
curl -H "Accept: text/csv" --data-urlencode query@wikidata/inchikey.rq -G https://qlever.cs.uni-freiburg.de/api/wikidata -o inchikey2wikidata.csv
groovy createDerby.groovy
Test the resulting Derby file by opening it in PathVisio
Use the BridgeDb QC tool to compare it with the previous mapping file
The BridgeDb repository has a tool to perform quality control (qc) on ID mapping files:
sh qc.sh old.bridge new.bridge
Upload the data to Figshare and update the following page for the BridgeDb (Github) Website on the following item (examples provided below):
Tag this repository with the DOI of the latest release.
To ensure we know exactly which repository version was used to generate a specific release, the latest commit used for that release is tagged with the DOI on Figshare. To list all current tags:
git tag
To make a new tag, run:
git tag $DOR
where $DOI is replaced with the DOI of the release.
At least the following projects need to be informed about the availability of the new mapping database: