cannin/enhance_nlp_interaction_network_gsoc2020

Enhance NLP Interaction Network

This repository contains the code used to get information required for analysis of Reactome failed queries.

Interface Consistency:

Utils Package:

Requirements

For extraction of MeSH terms, an UMLS license/account is required. If you do not have account, register at https://utslogin.nlm.nih.gov/cas/login and set the credentials in the configuration yaml file.

Notebooks

Python - Reactome_PMID_Metadata_Extraction , generates reactome_pmid_metadata.tsv , which contains metadata of PMIDs present in Reactome.
Python - Reactome_Failed_Query_Analysis , generates failed_query_analysis_output.tsv, which contains details regarding the failed query terms.
R - Reactome_Analysis , performs the analysis using above generated files, in case the above files are not available, they will be downloaded.

Supporting files

MTI WebAPI is used to get MeSH terms using their batch processing. Their code is in Java hence pyjnius is used to run the JAR files. The files are present in /lib.
These JAR files can be found in ziy/skr-webapi repository.

Following files are generated by the python notebooks, if the user only wants to perform Analysis using R code then they will be automatically downloaded from the links:

File	Generated by	Source
reactome_pmid_metadata.tsv	Reactome_PMID_Metadata_Extraction.ipynb	Link
failed_query_analysis_output.tsv	Reactome_Failed_Query_Analysis.ipynb	Link

Steps to follow

Make a copy of parameters_sample.yml named parameters.yml and set the configurations in it. Following are mandatory parameters to change in the YML file:
- MTI Credentials, register at https://utslogin.nlm.nih.gov/cas/login
```
mti:
  email_id : "example@example.com"
  username : "username"
  password : "password"
```
- INDRA Database REST URL
  indra_db_rest_url : "SET_INDRA_DB_URL"
- Reactome Parameters
  reactome_organism: "Homo sapiens"
- User Query
  query: "MATN2"
Please Note : If you want to skip Metadata file creation and only run the Analysis then skip step 3 and 4 and continue from step 5, the required files will be downloaded accordingly.
Execute Reactome_PMID_Metadata_Extraction.ipynb, this will generate reactome_pmid_metadata.tsv file, which is used in step 5,
Execute Reactome_Failed_Query_Analysis.ipynb, this will generate failed_query_analysis_output.tsv file, which is required in step 5

Do NOT perform Step 5 with partially generated output files from step 3 and 4. If you have partial file then delete those as the Rmd code with download missing files which are pre processed, if required.

Curators' UI

*Please note:* This step will require complete TSV files generated by Step 3 and 4, if these files are not present in your directory or you have skipped step 3,4 then they will be downloaded.
In RStudio Console enter following
rmarkdown::render('Reactome_Analysis.Rmd', output_file = 'analysis_output.nb.html')
OR
Open [Reactome_Analysis.Rmd**](./Reactome_Analysis.Rmd) in RStudio and run all the chunks to generate the analysis using Ctrl + Alt + R or follow the image below.

Output Files:

indra_output.html
Contains Statements from INDRA containing interactions for the query term
analysis_output.nb.html
Contains the analysis performed using Rmd file.
This file will not be generated if you use 'Run All' approach in previous step. To get the HTML output follow the image below

To run all notebooks and R code

Installation, (required when run without Docker)

pip install --no-cache-dir -r ./dependencies/requirements.txt
R -e 'source("./dependencies/installPackages.R")'

Make a copy of parameters_sample.yml named parameters.yml and set the configurations in it. Following are mandatory parameters to change in the YML file:
- MTI Credentials, register at https://utslogin.nlm.nih.gov/cas/login
```
mti:
  email_id : "example@example.com"
  username : "username"
  password : "password"
```
- INDRA Database REST URL
  indra_db_rest_url : "SET_INDRA_DB_URL"
- Reactome Parameters
  reactome_organism: "Homo sapiens"
- User Query
  query: "MATN2"
Execute the Python Notebooks and R file
bash startup.sh path/to/parameters.yml

Output Files:

indra_output.html
Contains Statements from INDRA containing interactions for the query term
analysis_output.nb.html
Contains the analysis performed using Rmd file.

Hot to run locally using Docker Image pritishaw/reactome-failed-query-analysis

Pull Docker Image
docker run --name reactome-failed-query-analysis pritishaw/reactome-failed-query-analysis:latest
Start Notebooks
docker pull pritishaw/reactome-failed-query-analysis:latest
Follow sequence of execution as mentioned above

Click to see terminal video

How to run locally using jupyter/repo2docker (Docker)

Installation
pip install jupyter-repo2docker
Build and Start Notebooks
jupyter-repo2docker https://github.com/cannin/enhance_nlp_interaction_network_gsoc2020
Note: Docker needs to be running in local machine
An URL with token will be printed in terminal, you can access Jupyter Notebooks and RStudio using that link as follows:
Jupyter Notebooks : Open the link directly, all Notebooks will be visible at /notebooks
RStudio : Go to /rstudio to open RStudio
Follow sequence of execution as mentioned above

Parameters

Sample file can be found here parameters_sample.yml. Following configurations can be made using the file. For testing the Python notebooks, you can use the template parameters_test.yml, it has configuration for processing a small subset of the query terms.

# PYTHON NOTEBOOK PARAMETERS ----
# Register at https://utslogin.nlm.nih.gov/cas/login for MTI credentials
mti:
  email_id : "example@example.com"
  username : "username"
  password : "password"

pmid_threshold : 20
indra_db_rest_url : "SET_INDRA_DB_URL"

reactome_failed_terms_link : "https://gist.githubusercontent.com/PritiShaw/03ce10747835390ec8a755fed9ea813d/raw/cc72cb5479f09b574e03ed22c8d4e3147e09aa0c/Reactome.csv"
failed_query_threshold : null # null Indicates all terms will be processed
failed_query_hits_threshold : 10

reactome_pmid_url : "https://reactome.org/download/current/ReactionPMIDS.txt"

failed_query_output_file_path : "failed_query_analysis_output.tsv"

pmid_chunk_limit : 0
pmid_metadata_output_path : "reactome_pmid_metadata.tsv"

# R NOTEBOOK (Rmd) PARAMETERS ----

# Notebook
max_dt_table_display : 100

# Python environment
python_virtualenv : "/srv/venv"

# General
min_failed_search_hits : 10

# Rank Terms
top_n_reactome_journals : 10
min_indra_query_term_count : 0
min_indra_statement_count : 0
min_pmc_citation_count : 0
min_oc_citation_count : 0

# Reactome Parameters
reactome_organism: "Homo sapiens"

# User Query
query: "MATN2"

# Output
all_mesh_by_top_level_pathways_file : "all_mesh_by_top_level_pathways_full.txt"
top_level_pathways_file : "top_level_pathways.txt"
indra_stmt_html_file : "indra_output.html"
indra_stmt_json_file : "indra_output.json"

How to use papermill

Papermill is used to parameterize the Python notebooks , to use this, follow the steps below:

Install from requirements.txt
pip install --no-cache-dir -r ./dependencies/requirements.txt
Setup Config YAML file
Create a copy of parameters_sample.yml and make the changes.
To Run the Notebooks
papermill Reactome_Failed_Query_Analysis.ipynb failed_query_analysis.ipynb --log-output -k python3 -f PATH/TO/CONFIG/FILE.yml
papermill Reactome_PMID_Metadata_Extraction.ipynb pmid_metadata.ipynb --log-output -k python3 -f PATH/TO/CONFIG/FILE.yml

Terminal Video