Closed saramsey closed 3 years ago
See also instructions in this Google Sheet: https://docs.google.com/presentation/d/1ezj-da1jrCshtfN-GuQXVSmwhsMFfRUsmbMhEOsD3eY/edit#slide=id.ge4812a41a4_0_130
Here are the available build-kg2-snakemake.sh
options: (Format: flag
[slots it works in, starting at 1])
test
[1]: This flag initiates a test build, which creates a much smaller graph (which can be used for debugging).all
[1]: This flag initiates a full build, which includes the extraction scripts. (Omitting this flag initiates a partial build, which requires that the output of all of the extraction scripts already exits).alltest
[1]: This flag initaites a test build that includes extracting SemMedDB's test edges file. Before you can run a build with the test
flag, you must run a build on that same instance with the alltest
flag. (SemMedDB's conversion requires a test version of the input). -n
[1-4]: This flag initiates a dryrun of the build, outputting to a different file (with the -n
flag in the file name). This is good to do before running a real build to make sure that the scripts you want to run will be included and the scripts you don't won't.nodes
[1-4]: This flag generates a version of kg2-simplified.json
that is exclusively the nodes (for debugging purposes). It takes extra time, so it should only be included when necessary. Also, if you plan to use it, familiarize yourself with the nodes
related code in build-kg2-snakemake.sh
.-R_*
[1-3]: This is our version of Snakemake's -R
flag. However, rather than using it in the form -R Rule
(ex. -R Merge
), we add an underscore between them (-R_Rule
) to simplify the command line options decoding process. This forces a rerun of all the rules that provide an input to the rule listed. For example, if you wanted to rerun all of the conversion rules, you might use -R_Merge
. This one is more tricky to use and I'd recommend both reading up on what Snakemake says about it and doing dryruns until you get the effect you are looking for.-F
[1-3]: This flag forces a rerun of all of the rules that lead up to the first rule in the Snakefile, which is Finish
and depends on all of the rules. Thus, this will rebuild everything.graphic
[1-3]: This flag generates the PNG diagram of the Snakemake workflowtravisci
[3-5]: This flag should only be used in the .travis.yml
file (for usage on a Travis CI instance). It ensures that the commands are configured to run on a Travis CI instance (where we can't use a virtualenv).Examples:
bash -x build-kg2-snakemake.sh -n test
(test
flag must be in position 1)bash -x build-kg2-snakemake.sh all -F nodes -n travisci
(every flag is in an allowable position for it)Below is the infores catalog file which came from downloading (as TSV) the Infores Catalog Google Sheet:
For step 5 above, I named the file infores-catalog.tsv
and copied it into ~/kg2-build/
in the instance:
For step 6 above, I am validating the infores-catalog.tsv
by running (as user ubuntu
and in the kg2-build
directory):
source ~/kg2-venv/bin/activate
python3 ~/kg2-code/validate_provided_by_to_infores_map_yaml.py \
~/kg2-code/kg2-provided-by-curie-to-infores-curie.yaml \
./infores-catalog.tsv
deactivate
It ran without producing any output (@ericawood confirmed via Zoom that this means that it ran without finding any errors).
now resuming the regular build instructions in the issue95
branch under "Option 1" at step (7)
Saving the build-kg2-snakemake-n.log
file here:
build-kg2-snakemake-n.log.zip
The build-kg2-snakemake-n.log
file ends with:
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+ date
Thu Aug 5 00:00:40 UTC 2021
+ echo '================ script finished ============================'
================ script finished ============================
so looks like normal output (?). Proceeding with step (8) in the Option 1 instructions in the README.md
in the issue95
branch....
OK, before running step (8), I forgot to update extract-semmeddb.sh
to pull version 43 of SemMedDB. Fixing that now (see 7cb6caa and 2d0e139)
Running this code in the instance, to update the code to the tip if the issue95
branch, before running step (8) of the instructions:
<ctrl>-a d
cd ~/kg2-code && git pull
screen -r
The
build-kg2-snakemake-n.log
file ends with:This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. + date Thu Aug 5 00:00:40 UTC 2021 + echo '================ script finished ============================' ================ script finished ============================
so looks like normal output (?). Proceeding with step (8) in the Option 1 instructions in the
README.md
in theissue95
branch....
Going forward, I would recommend checking that the expected number of rules ran as well. If you are doing a partial build, you should also verify that the rules you want are there and the rules you don't aren't.
If this build goes smoothly and you opt to merge the branch into master, please remember to edit these lines:
https://github.com/RTXteam/RTX-KG2/blob/2d0e139c3b8a4736d0e2579e49b281d2377647b6/build-kg2-snakemake.sh#L122-L123
to say origin/master
instead. (Essentially, these lines ensure that, if you want the KG2 nodes file to be generates, the potentially sed-ed files (see lines above) aren't used.)
Build died on Thursday 5 Aug 2021 at 0739 UTC
shutting down the instance for now. I will investigate as soon as I get to work.
OK, in /home/ubuntu/kg2-build/build-kg2-snakemake.log
(which has an , I am seeing the following error message:
Error in rule UniChem:
jobid: 33
output: /home/ubuntu/kg2-build/unichem/unichem-mappings.tsv
log: /home/ubuntu/kg2-build/extract-unichem.log (check log file(s) for error message)
shell:
bash -x /home/ubuntu/kg2-code/extract-unichem.sh /home/ubuntu/kg2-build/unichem/unichem-mappings.tsv > /home/ubuntu/kg2-build/extract-unichem.log 2>&1
(exited with non-zero exit code)
...
[Thu Aug 5 07:29:47 2021]
Finished job 30.
19 of 48 steps (40%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/.snakemake/log/2021-08-05T001205.093121.snakemake.log
Archiving build-kg2-snakemake.log
here:
Checking the last line of /home/ubuntu/kg2-build/extract-unichem.log
, I am seeing:
+ curl -s -L -f ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/oracleDumps/UDRI371/UC_XREF.txt.gz
The file's modification timestamp is:
-rw-rw-r-- 1 ubuntu ubuntu 2.1K Aug 5 00:12 extract-unichem.log
It appears that the UDRI371
directory is no longer present on the EBI FTP server, as shown here:
The
build-kg2-snakemake-n.log
file ends with:This was a dry-run (flag -n). The order of jobs does not reflect the order of execution. + date Thu Aug 5 00:00:40 UTC 2021 + echo '================ script finished ============================' ================ script finished ============================
so looks like normal output (?). Proceeding with step (8) in the Option 1 instructions in the
README.md
in theissue95
branch....Going forward, I would recommend checking that the expected number of rules ran as well. If you are doing a partial build, you should also verify that the rules you want are there and the rules you don't aren't.
OK, thanks, I have noted this in the README.md
see #106
Verified that the bugfix works, using:
/home/ubuntu/kg2-venv/bin/snakemake --snakefile /home/ubuntu/kg2-code/Snakefile -R --until UniChem
Resuming the build using:
bash -x ~/kg2-code/build-kg2-snakemake.sh all
Seeing an error in the ~/kg2-build/build-kg2-snakemake.log
file:
[Thu Aug 5 17:30:23 2021]
Error in rule KEGG:
jobid: 47
output: /home/ubuntu/kg2-build/kegg.json
log: /home/ubuntu/kg2-build/extract-kegg.log (check log file(s) for error message)
shell:
bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.l
og 2>&1
(exited with non-zero exit code)
tracking the fix for this issue in #107
Seeing an error in ~/kg2-build/build-kg2-snakemake.log
:
Traceback (most recent call last):
File "/home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py", line 279, in <module>
version_number = json_data['version'][0]['version']
KeyError: 'version'
[Thu Aug 5 17:30:27 2021]
Error in rule DrugCentral_Conversion:
jobid: 23
output: /home/ubuntu/kg2-build/kg2-drugcentral.json
shell:
/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py /home/ubuntu/k
g2-build/drugcentral/drugcentral_psql_json.json /home/ubuntu/kg2-build/kg2-drugcentral.json
(exited with non-zero exit code)
tracking the fix for this issue in #108
Seeing an error in the DrugCentral_Conversion
rule in ~/kg2-build/build-kg2-snakemake.log
file:
Traceback (most recent call last):
File "/home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py", line 279, in <module>
version_number = json_data['version'][0]['version']
KeyError: 'version'
[Thu Aug 5 17:30:27 2021]
Error in rule DrugCentral_Conversion:
jobid: 23
output: /home/ubuntu/kg2-build/kg2-drugcentral.json
shell:
/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py /home/ubuntu/kg2-build/drugcentral/drugcentral_psql_json.json /home/ubuntu/kg2-build/kg2-drugcentral.json
(exited with non-zero exit code)
tracking this issue as #109
Error in task Reactome_Conversion
:
[Thu Aug 5 17:31:29 2021]
Error in rule Reactome_Conversion:
jobid: 20
output: /home/ubuntu/kg2-build/kg2-reactome.json
log: /home/ubuntu/kg2-build/reactome-mysql-to-kg-json.log (check log file(s) for error message)
shell:
/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/reactome_mysql_to_kg_json.py /home/ubuntu/kg2-build/mysql-config.conf reactome /home/ubuntu/kg2-build/kg2-reactome.json > /home/ubuntu/kg2-build/reactome-mysql-to-kg-json.log 2>&1
(exited with non-zero exit code)
tracking this as #110
Error in task Ensembl_Conversion
:
Traceback (most recent call last):
File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 180, in <module>
graph = make_kg2_graph(input_file_name, test_mode)
File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 137, in make_kg2_graph
name = transcript['name']
KeyError: 'name'
[Thu Aug 5 17:32:19 2021]
Error in rule Ensembl_Conversion:
jobid: 11
output: /home/ubuntu/kg2-build/kg2-ensembl.json
shell:
/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/ensembl_json_to_kg_json.py /home/ubuntu/kg2-build/ensembl/ensembl_genes_homo_sapiens.json /home/ubuntu/kg2-build/kg2-ensembl.json
(exited with non-zero exit code)
tracking this as #111
Ran this test on buildkg2.rtx.ai
:
bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.log 2>&1
and got this error:
+ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/query_kegg.py /home/ubuntu/kg2-build/kegg.json
Traceback (most recent call last):
File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/caches/file_cache.py", line 72, in __init__
from lockfile import LockFile
ModuleNotFoundError: No module named 'lockfile'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/kg2-code/query_kegg.py", line 124, in <module>
kg2_util.save_json(run_queries(), args.outputFile, True)
File "/home/ubuntu/kg2-code/query_kegg.py", line 90, in run_queries
for results in send_query(query).split('\n'):
File "/home/ubuntu/kg2-code/query_kegg.py", line 37, in send_query
requests = CacheControlHelper()
File "/home/ubuntu/RTX-KG2/cache_control_helper.py", line 32, in __init__
self.sess = CacheControl(requests.session(), heuristic=CustomHeuristic(days=30), cache=FileCache('.web_cache'))
File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/caches/file_cache.py", line 82, in __init__
raise ImportError(notice)
ImportError:
NOTE: In order to use the FileCache you must have
lockfile installed. You can install it via pip:
pip install lockfile
On kg2lindsey.rtx.ai
, it looks like the installed version of lockfile
is 0.12.2:
(kg2-venv) ubuntu@ip-172-31-59-26:~$ pip3 freeze | grep lockfile
lockfile==0.12.2
Rerunning the KEGG test on buildkg2.rtx.ai
:
bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.log 2>&1
Another issue in the Ensembl-Conversion
rule:
ubuntu@ip-172-31-63-157:~/kg2-code$ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/ensembl_json_to_kg_json.py /home/ubuntu/kg2-build/ensembl/ensembl_genes_homo_sapiens.json /home/ubuntu/kg2-build/kg2-ensembl.json
Traceback (most recent call last):
File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 180, in <module>
graph = make_kg2_graph(input_file_name, test_mode)
File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 99, in make_kg2_graph
go_xrefs = add_prefixes_to_curie_list(gene_dict.get('GO', ''), kg2_util.CURIE_PREFIX_GO)
File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 60, in add_prefixes_to_curie_list
curie = curie['term'] + ' ' + curie['evidence'][0]
TypeError: can only concatenate str (not "NoneType") to str
Tracking as #112
OK, I believe the Ensembl-Conversion
bug was fixed in #112. Rerunning manually that rule on buildkg2.rtx.ai
:
/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/ensembl_json_to_kg_json.py /home/ubuntu/kg2-build/ensembl/ensembl_genes_homo_sapiens.json /home/ubuntu/kg2-build/kg2-ensembl.json
seemed to generate the file that I want:
ls -alh kg2-ensembl.json
-rw------- 1 ubuntu ubuntu 1022M Aug 5 23:47 kg2-ensembl.json
OK, I manually ran the DrugCentral-Conversion
rule and it produced an 83M file ~/kg2-build/kg2-drugcentral.json
:
ubuntu@ip-172-31-63-157:~/kg2-build$ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py /home/ubuntu/kg2-build/drugcentral/drugcentral_psql_json.json /home/ubuntu/kg2-build/kg2-drugcentral.json
ubuntu@ip-172-31-63-157:~/kg2-build$ ls -alh drugcentral/
total 1.1G
drwxrwxr-x 2 ubuntu ubuntu 4.0K Aug 5 00:16 .
drwxrwxr-x 16 ubuntu ubuntu 12K Aug 5 23:54 ..
-rw-rw-r-- 1 ubuntu ubuntu 998M Aug 5 00:13 drugcentral.sql.gz
-rw-rw-r-- 1 ubuntu ubuntu 124M Aug 5 00:17 drugcentral_psql_json.json
-rw-rw-r-- 1 ubuntu ubuntu 27 Aug 5 00:17 psql_dump_file.txt
ubuntu@ip-172-31-63-157:~/kg2-build$ ls -alh kg2-drugcentral.json
-rw------- 1 ubuntu ubuntu 83M Aug 5 23:54 kg2-drugcentral.json
On buildkg2.rtx.ai
, I am manually running the Reactome_Conversion
task:
/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/reactome_mysql_to_kg_json.py /home/ubuntu/kg2-build/mysql-config.conf reactome /home/ubuntu/kg2-build/kg2-reactome.json > /home/ubuntu/kg2-build/reactome-mysql-to-kg-json.log 2>&1
Current process table on buildkg2.rtx.ai
:
Note, I am not presently testing the fix for #108 since the DrugCentral_Conversion
rule seemed to complete despite those SQL errors. So issue #108 will remain marked "verify in next build" for now.
Looks like multi_ont_to_json_kg.py
exited with an error:
Error in rule Ontologies_and_TTL:
jobid: 7
output: /home/ubuntu/kg2-build/kg2-ont.json
log: /home/ubuntu/kg2-build/build-multi-ont-kg.log (check log file(s) for error message)
shell:
bash -x /home/ubuntu/kg2-code/build-multi-ont-kg.sh /home/ubuntu/kg2-build/umls_cuis.tsv /home/ubuntu/kg2-build/kg2-ont.json > /home/ubuntu/kg2-build/build-multi-ont-kg.log 2>&1
(exited with non-zero exit code)
see issue #113 for details
Here is the complete build-kg2-snakemake.log
for the build that resulted in the errors reported above:
build-kg2-snakemake.log.zip
Seeing a bunch of warnings about AraPort in multi_ont_to_kg_json.py, recorded in #114
Generating a build plan now, on buildkg2.rtx.ai
, by running (as user ubuntu
):
cd
bash -x ~/kg2-code/build-kg2-snakemake.sh -n
Just inspected the build plan build-kg2-snakemake-n.log
. The plan looks correct:
Job counts:
count jobs
1 Finish
1 KEGG_Conversion
1 Merge
1 Ontologies_and_TTL
1 Simplify
1 Simplify_Stats
1 Slim
1 Stats
1 TSV
9
Here is the "-n" plan file: build-kg2-snakemake-n.log.zip
Now running build-kg2-snakemake.sh
, as user ubuntu
in /home/ubuntu
in a screen session on buildkg2.rtx.ai
:
cd
bash -x ~/kg2-code/build-kg2-snakemake.sh
In the build-kg2-snakemake.log
file, error in rule Simplify
:
Error in rule Simplify:
jobid: 3
output: /home/ubuntu/kg2-build/kg2-simplified.json
log: /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log (check log file(s) for error message)
shell:
bash -x /home/ubuntu/kg2-code/run-simplify.sh /home/ubuntu/kg2-build/kg2.json /home/ubuntu/kg2-build/kg2-simplified.json /home/ubuntu/kg2-build/kg2-version.txt > /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log 2>&1
(exited with non-zero exit code)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/.snakemake/log/2021-08-06T195800.503566.snakemake.log
~
Complete logfile is attached her: build-kg2-snakemake.log.zip
See issue #115 for the details of this error message
restarting the build by running (as user ubuntu
in a screen
session on system buildkg2.rtx.ai
:
cd
bash -x ~/kg2-code/build-kg2-snakemake.sh
KG2.7.2 build (on buildkg2.rtx.ai
) terminated on 8/8 at 20:10 UTC; the error (from the build-kg2-snakemake.log
file) is shown here:
[Sun Aug 8 20:10:46 2021]
Error in rule Simplify:
jobid: 3
output: /home/ubuntu/kg2-build/kg2-simplified.json
log: /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log (check log file(s) for error message)
shell:
bash -x /home/ubuntu/kg2-code/run-simplify.sh /home/ubuntu/kg2-build/kg2.json /home/ubuntu/kg2-build/kg2-simplified.json /home/ubuntu/kg2-build/kg2-version.txt > /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log 2>&1
(exited with non-zero exit code)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/.snakemake/log/2021-08-08T174120.823762.snakemake.log
complete logfile here: build-kg2-snakemake.log.zip
Issue reported in detail in #116
resuming the KG2.7.2 build on buildkg2.rtx.ai
(in a screen session):
cd
bash -x ~/kg2-code/build-kg2-snakemake.sh
Then: ctrl-a d
OK, filter_kg_and_remap_predicates.py
has completed on buildkg2.rtx.ai
. Now it is running slim_kg2.py
.
Confirming the correct KG2 version number in this build:
At 21:58 UTC on 8/9/2021, an error occurred in the rule TSV
for the KG2.7.2 build on buildkg2.rtx.ai
(see #117 for details).
The build process seems to have hung at approximately 2025 UTC on Aug. 9, 2021. See #118.
Now that #118 is (hopefully) fixed, resuming the build on buildkg2.rtx.ai
....
r5a.8xlarge
instancekg2build.rtx.ai
for the buildissue95
branchREADME.md
in theissue95
branch to reflect the new flags added for Snakemakekg2-simplified-report.json
; compare against previouskg2-simplified-report.json
to identify any major changeskg2endpoint4.rtx.ai
kg2-7-2.rtx.ai
pointing tokg2endpoint4.rtx.ai
kg2-versions.md
in theissue95
branchissue95
branch intomaster
branch (carefully audit any commits upstream to see if they change anything about the build process??)build-kg2-snakemake.sh
(see https://github.com/RTXteam/RTX-KG2/issues/104#issuecomment-893104520)KG2.7.2
KG2.7.2c
KG2.7.2c
kg2c_lite_2.7.2.json.gz
to NCATS repo