ExposuresProvider / icees-kg

Integrated Clinical and Environmental Exposures Service (ICEES) Knowledge Graph
0 stars 0 forks source link

1. Overview

This tool generates p-values and chi-squared values between curies. These p-values are generated by looking at the correlations between features in the patient/feature ICEES database.

This tool uses two different repositories: 1) Data Services (https://github.com/RENCI-AUTOMAT/Data_services) 2) Plater (https://github.com/TranslatorSRI/Plater)

The Data tools repo is used to help generate tsv files that define the nodes and edges of a graph. In this graph, the curies are the nodes, and the edges link all the nodes and include a "p_value" property. The plater tool is used to create a neo4j database that has the curies as nodes and p_value properties attached to the edges between the nodes.

These scripts were tested and assumptions were written for macOSX and python3.9, but should work on Ubuntu and other versions of python. If an older version of python is to be used, it's likely that the version numbers of some of the packages need to be downgraded in the ./requirements.txt file.

This should work with windows as well, but file paths will need to be modified.

2. Getting Started

This section will describe the steps that need to be performed before any of the p-values are computed or the NEO4j databases are created.

a. Create a Virtual Python Environment

First, create, activate, and update a virutal environment.

python3.9 -m venv <path_to_venv>
source <path_to_venv>/bin/activate
pip install --upgrade pip

Usually is set to ~/.venvs/

Next, install all requirements needed to run p-value scripts.

cd <path_to_icees_kg_folder>
pip install -r requirements.txt
pip install ./Plater/PLATER --no-dependencies

b. Install Docker

Follow instructions here https://docs.docker.com/get-docker/ to install docker on your machine.

c. Create .env file

Create a .env file in the root directory and add the following variables:

DATA_PATH="FILL_THIS_IN" FEATURES_YAML="FILL_THIS_IN" IDENTIFIERS_YAML="FILL_THIS_IN" NODE_NORM="FILL_THIS_IN" NAME_RESOLVER="FILL_THIS_IN" DATASET_NAME="FILL_THIS_IN"

3. Generate P-Values tsv file

There are two scripts that need to be run to generate tsv values that will be used by PLATER CLI tools to create a neo4j database with curie p-values. Both of these scripts live in the ./tsv_maker/ folder.

First, environment variables need to be set prior to running any tsv_maker scripts.

cd ./tsv_maker
chmod +x ./set_up_test_env.sh
source ./set_up_test_env.sh

Next, json files (node and edge) are created using the make_jsons.py script. Note: This takes a LONG time to run. Best to leave it running over night.

cd ./tsv_maker
python make_jsons.py

The make_jsons.py script creates two files: p_val_edges.json and p_val_nodes.json. These files are then converted to .tsv files with the following script.

python jsons_to_tsv.py

The jsons_to_tsv.py script creates two files: p_val_edges.tsv and p_val_nodes.tsv. These files are used in the next steps to populate the neo4j database

Local Development:

If you want to run everything locally, your local instance of neo4j needs to have the apoc plugin installed. Run the following command to create the docker image:

sudo docker run -d --name icees_kg \
    -p 7474:7474 \
    -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/test \
    -e NEO4JLABS_PLUGINS=\[\"apoc\"\] \
    -e NEO4J_apoc_export_file_enabled=true \
    -e NEO4J_apoc_import_file_enabled=true \
    -e NEO4J_apoc_import_file_use__neo4j__config=true \
    -v $PWD/data:/data \
    -v $PWD/backups:/backups \
    neo4j:4.2

Use PLATER to stand up API

Now you have a neo4j database up and running, PLATER can be used to spin-up a TRAPI API.

Navigate to the Plater folder and run the main plater script.

cd ../Plater
chmod +x main.sh
./main.sh

If you would like to use a neo4j database on another port, with a different name, or different password, modify the .env file in the plater folder.

This spins up a PLATER api that can be accessed at port 8080 (as defined in the .env file). The API documentation can be found at http://localhost:8080/docs.

4. Create and update a neo4j database using KGX

First, create neo4j database docker container

sudo docker run -d --name icees_kg \
    -p 7474:7474 \
    -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/test \
    -v $PWD/data:/data \
    -v $PWD/backups:/backups \
    neo4j:4.2

Second, use kgx to populate the neo4j database with the .tsv files created in the previous section. NOTE: This will take ~ 1 hour to run.

kgx neo4j-upload --uri bolt://localhost:7687 --username neo4j --password test --input-format tsv ./build/p_val_nodes.tsv ./build/p_val_edges.tsv

5. Create dump file of neo4j database

In order to dump the database, the docker container needs to be stopped:

docker stop icees_kg

To dump the database, run:

sudo docker run -i -t --rm \
    -v $PWD/data:/data \
    -v $PWD/backups:/backups \
    --entrypoint /bin/bash \
    neo4j:4.2

This will open a terminal in the neo4j image. Then you need to run this command:

neo4j-admin dump --to=/backups/icees_kg.dump

6. Take dump file and upload to kubernetes