This repository contains the workflow for generating the SRI Reference Knowledge Graph, a combination of the integrated Monarch KG and other relevant data sources.
The purpose of this KG is to serve the following communities,
There are several ways of building the graph.
This can be achieved by parsing the N-Triples through KGX.
The transform.yaml lists all the sources that are transformed as part of this workflow. Each source has its own specific properties to facilitate the parsing of the N-Triples by KGX.
The transform.yaml
can be used to generate a set of TSVs for each source in the
KGX interchange format.
The merge.yaml lists all the sources in TSV format (as generated by KGX) which are used in the merge process to generate an integrated KG.
First create a folder called data
:
mkdir data && cd data
Then download all the required N-Triples to the data
folder:
wget -r -nd "https://archive.monarchinitiative.org/@DATA_VERSION@/rdf/blcategories/"
Where @DATA_VERSION@
must be replaced with a proper data version from archive.monarchinitiative.org
Also be sure to get Monarch Ontologies in OBOGraph JSON form:
wget https://ci.monarchinitiative.org/view/pipelines/job/monarch-ontology-json-sri/lastSuccessfulBuild/artifact/build/monarch-ontology-sri-translator.json
And ChEBI in OBOGraph JSON form:
wget http://kg-hub.berkeleybop.io/frozen_incoming_data/chebi.json.gz
Then, compress all the files in the data
folder:
pigz -p 2 -9r *
First set up a virtual environment, note that the kgx merge step requires python >= 3.8
# create a new virtual environment
python3.8 -m venv env
# active the virtual environment
source env/bin/activate
Then install the dependencies listed in requirements.txt
,
pip install -r requirements.txt
There is a Makefile
that runs the following workflow,
kgx transform
kgx merge
kgx neo4j-upload
To run the workflow,
make all
The Makefile
relies on a set of arguments that drives the behavior of the Makefile
with the following defaults:
DATA_DIR=data
OUTPUT_DIR=data-parsed
PROCESSES=1
NEO4J_DATA_DIR=`pwd`/neo_data
SUFFIX=build
DATA_VERSION=202009
KG_VERSION=0.3.0
To override the defaults,
make all SUFFIX=build_20201021 PROCESSES=4 DATA_DIR=monarch-data OUTPUT_DIR=sri-reference-kg-0.3.0 KG_VERSION=0.3.0
Note: To ensure that the pipeline runs end-to-end, you would need a machine that has at least 8 cores of CPU and 100GB in memory.