HazyResearch / dd-genomics

The Genomics DeepDive project
Apache License 2.0
11 stars 6 forks source link

TO-DO LIST

See Milestones/Issues.

SETUP:

Setting the dd-genomics repo:

  1. Initialize \& update submodules: git submodule update --init. (Note: you will need to have an SSH key for the computer being used set up with github, as well as have permission to access the submodule repos)

  2. Define a db.url file, such as: [postgres|greenplum]://localhost:6432/genomics_tpalo

  3. Copy template file env.sh to env_local.sh and modify this file with your local settings (it's ignored by git, and prefered by the run script). Make sure to set your PATH so that the correct version of psql is on it. Be sure in particular to define the variable APP_HOME as the path to your dd-genomics repo.

  4. Install nltk: sudo pip install nltk. Download the corpora wordnet: in Python: import nltk; nltk.download() and download the corpora wordnet.

  5. Fetch and process ontology files. All the code to make the files in onto is in make_all.sh (in principle). However, the script is extremely brittle. What you should do in fact is delete or rename your own onto directory, then copy Johannes' main onto directory from /lfs/raiders7/0/jbirgmei/onto.

  6. Pre-process & load the data: See the Parser README for detailed instructions; then save the output table to input/sentences_input.* (or copy an existing sentences_input table to this location).

  7. Source the environment vars: source env_local.sh. NOTE that this should be done before any deepdive run or action!

  8. Compile the application: deepdive compile

Running DeepDive:

Running views

Labeling data:

Go to labeling/ and follow the documentation

Evaluate the System

run ./evaluation.sh $RELATION $VERSION [$CONFIDENCE] [$OPTOUT] where :

The evaluation.sh script will compute the necessary statistics of your current performance using the holdout set and output the following files:

These files are stored under results_log/$USERNAME/$RELATION-$DATE/.

Note: Only the stats files are shared via Github

CAVEAT: in the current implementation, changing the input data will require manual modification in the compute_causation_stats.sh, compute_association_stats.sh, compute_gene_stats.sh by updating the path of the sentences_input file.

More detailed evaluation: Go to the util directory and execute the scripts there. They names hopefully explain what they're for. E.g., execute cd util; ./gp_precision_stats 3 to get precision holdout statistics for labeling set version 3; execute cd util; ./gp_precision_stats to get precision statistics for all labeling set versions.

CURRENT EXPECTATION CUTOFFS:

The files results_log/*_cutoff (e.g., g_cutoff and gp_cutoff) contain the current expectation cutoff (e.g., 0.5 and 0.75).

Error Analysis

The Error Analysis doc is at https://docs.google.com/document/d/1u6fPO55YGR5BpJOJTDypAt8MfQImhS9PF9E_bkCqiz8/edit?usp=sharing .

Raiders 7 notes...

DATA:

[8/8/15]: Current datasets to use (with With ROOT=/dfs/scratch0/ajratner):

OLD STUFF...

Notes on Simple Debugging Routines

Basic TSV extractor debugging

In one very simple routine, we can just find some sentences in the databse that would be decent for testing; for example, for basic debugging of the pheno_extract_candidates.py UDF, we can execute the following query in psql:

COPY (SELECT 
        doc_id, sent_id, words, lemmas, poses, ners 
      FROM sentences_input 
      WHERE words LIKE '%myeloid%'
      LIMIT 10)
TO '/tmp/pheno_extractor_debugging_myeloid_10.tsv' 
WITH DELIMITER '\t';

We then just debug using print statements in the code & etc. as we normally would with any standalone python script:

python code/pheno_extract_candidates.py < /tmp/pheno_extractor_debugging_myeloid_10.tsv

Running Dashboard for Reports

Make sure you have run util/update-mindbender.sh at least once. It will download the util/mindbender command, which includes Mindtagger as well as Dashboard.

(On raiders2, to get the correct psql version, do: export PATH=/dfs/scratch1/netj/wrapped/greenplum:$PATH first.)

To produce a set of reports using Dashboard after a GDD run, use the following steps:

(
. env_local.sh
export GDD_PIPELINE=
util/mindbender snapshot
)

This will produce a set of reports under a directory pointed by snapshot/LATEST as configured in snapshot-default.conf.

To view the produced reports, you can use the Dashboard GUI by starting it with the following command and opening the URL it prints:

(
. env_local.sh
export GDD_PIPELINE=
PORT=12345 util/mindbender dashboard
)

You may need to change PORT=12345 value if someone else is already using it. When Dashboard URL is loaded in your web browser, you can navigate to the first snapshot in the top "View Snapshots" dropdown.

Running Mindtagger for Labeling and Evaluation

Another great way to understand the output of the DeepDive system is to inspect a sample of indiviual examples and perform error analysis. We use a GUI tool called Mindtagger to expedite the labeling tasks necessary for performing this evaluation. Mindtagger provides a clean interface for inspecting individual mention candidates.

We have created a set of Mindtagger labeling templates for genomics-related tasks. First, create a new task by running

cd labeling
./create-new-task.sh TASK

where TASK is the name of a labeling task (run create-new-task.sh with no arguments for a list of all tasks). Currently, the most useful tasks are:

See eval/ directory for specific scripts for evaluation in other ways (e.g. pheno recall evaluation against MeSH).

Once you've created your task(s), start the Mindtagger GUI by running:

./start-gui.sh

and then open a browser to localhost to view all the created tasks & label data!

Running the Dashboard Snapshot

To run a dashboard snapshot, do

(source ./env_local.sh; ./util/mindbender snapshot gill) 2>&1 | grep -v 'declare'

Then start the dashboard, if it's not already running, with

(source env_local.sh
export GDD_PIPELINE=
PORT=XXXX util/mindbender dashboard
)

PATH

Add the following to your zshrc on raiders7:

export PATH=/lfs/raiders7/0/USERNAME/local/bin:/usr/local/greenplum-db/bin_wrapped:~/local/bin:/usr/local/jdk1.8.0_66/bin:/lfs/raiders7/0/USERNAME/deepdive/util:/lfs/raiders7/0/USERNAME/deepdive/util:$PATH

Getting Data From NCBI

Main changes from application.conf:

- During a run, the driver now fills out a temporary table and replaces the corresponding table only at the end of the run of an extractor. Therefore, no more loss of data if a run is stopped in the middle. 

- The column on which the table will be distributed by on greenplum is now defined by the annotation @distributed_by in front of the respective column. The driver will detect if the db used is postgres or greenplum (done by precising it in db.url) and add the "distributed by ..." statement only if needed, therefore no need to distringuish the app.ddlog file between psql and greenplum.

- To create views, look at the documentation in /lfs/raiders7/0/tpalo/dd-genomics_for_views (to come soon).

- For more information about the ddlog language, look at http://deepdive.stanford.edu/doc/basics/ddlog.html and https://github.com/HazyResearch/dd-genomics.

- Slight bug in the compiler for now, when a view is defined by an extractor in app.ddlog, the command "deepdive compile" returns an error precising that the table is not declared. This can be fixed by adding: deepdive.schema.relations { name_of_th_view { "type": "view" } } in deepdive.conf.

- many temporary tables and views (due to complicated sql queries that cannot be translated directly in ddlog). This should not effect the speed of the process.

limitations and remarks:

- inputs are now done with the same "deepdive do ..." command. Therefore, all the inputs and loads are put in the folder input/ which contains scripts or aliases to the real data (most of the time in onto/).

- a call to an elt of a tab doesn't work. Therefore, in the extractor "non_gene_acronyms_extract_candidates", the sql query is slightly different (doesn't include gm.wordidxs[1] and a.words[a.wordidx] LIKE '-LRB-';) and the udf is slightly changed in non_gene_acronyms_extract_candidates_ddlog.py to make this comparison in the python script.

- "delete from" doesn't exist in ddlog, therefore, for gene_mentions, we have to create a temporary table gene_mentions_temp_before_non_gene_acronyms_delete_candidates which contains all the rows. Only the rows with the good criteria for non_gene_acronyms are put in gene_mentions.

- In ddlog, we cannot add twice in pheno_mentions when the second addition requires the first one beforehand. Therefore, the table pheno_mentions_without_acronyms is first created, which is used to compute pheno_acronyms_aggregate_candidates. Then, the result of pheno_acronyms_insert_candidates and the initial extractions in pheno_mentions_without_acronyms are put in pheno_mentions

- The shell script ${APP_HOME}/util/serialize_genepheno_pairs_split.sh genomics cannot be translated in ddlog during the deepdive run. Therefore we add it in a deepdive.conf file with the corresponding input and output table and it is correctly placed in the whole run.

- The holdout fraction cannot be defined in ddlog. Therefore we can add the sentence "calibration.holdout_fraction: 0.1" in the deepdive.conf file.

- in ddlog, when we create variables and inference rules, the variable is not just a column of an existing table, it has to be a new table. For instance, the variable is_true of gene_mentions_filtered, we create the table gene_mentions_filtered_inference. Therefore, don't forget to add in the pipeline the extractor linking the table_inference and the corresponding mention table !
For instance, here, we have to add the extractor ext_gene_mentions_filtered_inference, otherwise the inference part doesn't bring any result...