See Milestones/Issues.
Setting the dd-genomics repo:
Initialize \& update submodules: git submodule update --init
. (Note: you will need to have an SSH key for the computer being used set up with github, as well as have permission to access the submodule repos)
Define a db.url
file, such as: [postgres|greenplum]://localhost:6432/genomics_tpalo
Copy template file
and modify this file with your local settings (it's ignored by git, and prefered by the run script). Make sure to set your PATH
so that the correct version of psql
is on it. Be sure in particular to define the variable APP_HOME as the path to your dd-genomics repo.
Install nltk: sudo pip install nltk
. Download the corpora wordnet: in Python: import nltk;
and download the corpora wordnet.
Fetch and process ontology files. All the code to make the files in onto is in
(in principle). However, the script is extremely brittle. What you should do in fact is delete or rename your own onto directory, then copy Johannes' main onto directory from /lfs/raiders7/0/jbirgmei/onto
Pre-process & load the data: See the Parser README for detailed instructions; then save the output table to input/sentences_input.*
(or copy an existing sentences_input table to this location).
Source the environment vars: source
. NOTE that this should be done before any deepdive run or action!
Compile the application: deepdive compile
Run the command deepdive do ...
with the name of the table you want to fill. Deepdive will suggest the operations it has to do related to this table, and select a plan including all upstream operations. For example, if you want to run the whole pipeline, use do
on the last table: deepdive do calibration-plots
To mark as done (done
), or conversely as yet to be done (todo
), use the command deepdive mark ...
. Deepdive will also mark all downstream operations. For example, if you want each process to be mark as undone, use do
on the first table: deepdive mark todo init/db
Run deepdive plan
to see all the operations possibles.
You can have access the overall flow of the application in ${APP_HOME}/run/dataflow.svg
(Chrome works well for this).
Overall, just run deepdive
to see all the commands possible.
You can prepare all the data by simply running Then run the commands displayed by :wq the different vim files (I will try to add a pipeline for that). After a certain time, the run should end after creating all the indexes required for the views. You can then launch the views (very quick) by ES_HEAP_SIZE=25g; PORT=$RANDOM mindbender search gui. These intructions are displayed at the end of the script The link to which access your views will be displayed in the terminal.
A few comments:
Go to labeling/
and follow the documentation
where :
The script will compute the necessary statistics of your current performance using the holdout set and output the following files:
: contains the summary statistics plus a breakdown over the labelersTP.tsv
: contains the true positives with three columns (relation_id, label, labeler, expectation)FP.tsv
: contains the false positives with three columns (relation_id, label, labeler, expectation)FN.tsv
: contains the false negatives with three columns (relation_id, label, labeler, expectation)holdout_set.tsv
: contains the full holdout set along with the labels and labelersinput_data
: contains a path to the input sentence data and the number of sentences for sanity checkThese files are stored under results_log/$USERNAME/$RELATION-$DATE/
Note: Only the stats files are shared via Github
CAVEAT: in the current implementation, changing the input data will require manual modification in the
by updating the path of the sentences_input file.
More detailed evaluation: Go to the util directory and execute the scripts
there. They names hopefully explain what they're for. E.g., execute cd util; ./gp_precision_stats 3
to get precision holdout statistics for labeling set
version 3; execute cd util; ./gp_precision_stats
to get precision statistics
for all labeling set versions.
The files results_log/*_cutoff
(e.g., g_cutoff and gp_cutoff) contain the current expectation cutoff (e.g., 0.5 and 0.75).
The Error Analysis doc is at .
mosh yourusername@raiders7
[8/8/15]: Current datasets to use (with With ROOT=/dfs/scratch0/ajratner
raiders2:genomics_production.{sentences, doc_metadata}
In one very simple routine, we can just find some sentences in the databse that would be decent for testing; for example, for basic debugging of the
UDF, we can execute the following query in psql:
doc_id, sent_id, words, lemmas, poses, ners
FROM sentences_input
WHERE words LIKE '%myeloid%'
TO '/tmp/pheno_extractor_debugging_myeloid_10.tsv'
We then just debug using print statements in the code & etc. as we normally would with any standalone python script:
python code/ < /tmp/pheno_extractor_debugging_myeloid_10.tsv
Make sure you have run util/
at least once.
It will download the util/mindbender
command, which includes Mindtagger as well as Dashboard.
(On raiders2, to get the correct psql version, do: export PATH=/dfs/scratch1/netj/wrapped/greenplum:$PATH
Run the command "deepdive do ..." with the name of the table you want to fill. Deepdive will suggest you all the operations it has to do for that. For instance, if you want to run the whole pipeline, run "deepdive do model/calibration-plots".
To mark as done or todo some tables, use the command "deepdive mark ...". For instance, if you want each process to be mark as undone, run "deepdive mark todo init/db"
Run "deepdive plan" to see all the operations possibles.
Overall, just run "deepdive" to see all the commands possible.
you can have access at the overall flow of the application in ${APP_HOME}/run/dataflow.svg (Chrome for instance works well for it).
To produce a set of reports using Dashboard after a GDD run, use the following steps:
util/mindbender snapshot
This will produce a set of reports under a directory pointed by snapshot/LATEST
as configured in snapshot-default.conf
To view the produced reports, you can use the Dashboard GUI by starting it with the following command and opening the URL it prints:
PORT=12345 util/mindbender dashboard
You may need to change PORT=12345
value if someone else is already using it.
When Dashboard URL is loaded in your web browser, you can navigate to the first snapshot in the top "View Snapshots" dropdown.
Another great way to understand the output of the DeepDive system is to inspect a sample of indiviual examples and perform error analysis. We use a GUI tool called Mindtagger to expedite the labeling tasks necessary for performing this evaluation. Mindtagger provides a clean interface for inspecting individual mention candidates.
We have created a set of Mindtagger labeling templates for genomics-related tasks. First, create a new task by running
cd labeling
where TASK is the name of a labeling task (run
with no arguments for a list of all tasks).
Currently, the most useful tasks are:
See eval/
directory for specific scripts for evaluation in other ways (e.g. pheno recall evaluation against MeSH).
Once you've created your task(s), start the Mindtagger GUI by running:
and then open a browser to localhost to view all the created tasks & label data!
To run a dashboard snapshot, do
(source ./; ./util/mindbender snapshot gill) 2>&1 | grep -v 'declare'
Then start the dashboard, if it's not already running, with
PORT=XXXX util/mindbender dashboard
Add the following to your zshrc on raiders7:
export PATH=/lfs/raiders7/0/USERNAME/local/bin:/usr/local/greenplum-db/bin_wrapped:~/local/bin:/usr/local/jdk1.8.0_66/bin:/lfs/raiders7/0/USERNAME/deepdive/util:/lfs/raiders7/0/USERNAME/deepdive/util:$PATH
ssh to one of the following IP addresses (the silk machines): (this is **
Type ftp
Username: anonymous
Go to the directory and start downloading
Main changes from application.conf:
- During a run, the driver now fills out a temporary table and replaces the corresponding table only at the end of the run of an extractor. Therefore, no more loss of data if a run is stopped in the middle.
- The column on which the table will be distributed by on greenplum is now defined by the annotation @distributed_by in front of the respective column. The driver will detect if the db used is postgres or greenplum (done by precising it in db.url) and add the "distributed by ..." statement only if needed, therefore no need to distringuish the app.ddlog file between psql and greenplum.
- To create views, look at the documentation in /lfs/raiders7/0/tpalo/dd-genomics_for_views (to come soon).
- For more information about the ddlog language, look at and
- Slight bug in the compiler for now, when a view is defined by an extractor in app.ddlog, the command "deepdive compile" returns an error precising that the table is not declared. This can be fixed by adding: deepdive.schema.relations { name_of_th_view { "type": "view" } } in deepdive.conf.
- many temporary tables and views (due to complicated sql queries that cannot be translated directly in ddlog). This should not effect the speed of the process.
limitations and remarks:
- inputs are now done with the same "deepdive do ..." command. Therefore, all the inputs and loads are put in the folder input/ which contains scripts or aliases to the real data (most of the time in onto/).
- a call to an elt of a tab doesn't work. Therefore, in the extractor "non_gene_acronyms_extract_candidates", the sql query is slightly different (doesn't include gm.wordidxs[1] and a.words[a.wordidx] LIKE '-LRB-';) and the udf is slightly changed in to make this comparison in the python script.
- "delete from" doesn't exist in ddlog, therefore, for gene_mentions, we have to create a temporary table gene_mentions_temp_before_non_gene_acronyms_delete_candidates which contains all the rows. Only the rows with the good criteria for non_gene_acronyms are put in gene_mentions.
- In ddlog, we cannot add twice in pheno_mentions when the second addition requires the first one beforehand. Therefore, the table pheno_mentions_without_acronyms is first created, which is used to compute pheno_acronyms_aggregate_candidates. Then, the result of pheno_acronyms_insert_candidates and the initial extractions in pheno_mentions_without_acronyms are put in pheno_mentions
- The shell script ${APP_HOME}/util/ genomics cannot be translated in ddlog during the deepdive run. Therefore we add it in a deepdive.conf file with the corresponding input and output table and it is correctly placed in the whole run.
- The holdout fraction cannot be defined in ddlog. Therefore we can add the sentence "calibration.holdout_fraction: 0.1" in the deepdive.conf file.
- in ddlog, when we create variables and inference rules, the variable is not just a column of an existing table, it has to be a new table. For instance, the variable is_true of gene_mentions_filtered, we create the table gene_mentions_filtered_inference. Therefore, don't forget to add in the pipeline the extractor linking the table_inference and the corresponding mention table !
For instance, here, we have to add the extractor ext_gene_mentions_filtered_inference, otherwise the inference part doesn't bring any result...