RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

run a KG2.7.2 build #104

Closed saramsey closed 3 years ago

saramsey commented 3 years ago
saramsey commented 3 years ago

See also instructions in this Google Sheet: https://docs.google.com/presentation/d/1ezj-da1jrCshtfN-GuQXVSmwhsMFfRUsmbMhEOsD3eY/edit#slide=id.ge4812a41a4_0_130

ecwood commented 3 years ago

Here are the available build-kg2-snakemake.sh options: (Format: flag [slots it works in, starting at 1])

Examples:

saramsey commented 3 years ago

Below is the infores catalog file which came from downloading (as TSV) the Infores Catalog Google Sheet:

infores-catalog.tsv.zip

saramsey commented 3 years ago

For step 5 above, I named the file infores-catalog.tsv and copied it into ~/kg2-build/ in the instance:

saramsey commented 3 years ago

For step 6 above, I am validating the infores-catalog.tsv by running (as user ubuntu and in the kg2-build directory):

source ~/kg2-venv/bin/activate
python3 ~/kg2-code/validate_provided_by_to_infores_map_yaml.py \
  ~/kg2-code/kg2-provided-by-curie-to-infores-curie.yaml \
  ./infores-catalog.tsv
deactivate

It ran without producing any output (@ericawood confirmed via Zoom that this means that it ran without finding any errors).

saramsey commented 3 years ago

now resuming the regular build instructions in the issue95 branch under "Option 1" at step (7)

saramsey commented 3 years ago

Saving the build-kg2-snakemake-n.log file here: build-kg2-snakemake-n.log.zip

saramsey commented 3 years ago

The build-kg2-snakemake-n.log file ends with:

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+ date
Thu Aug  5 00:00:40 UTC 2021
+ echo '================ script finished ============================'
================ script finished ============================

so looks like normal output (?). Proceeding with step (8) in the Option 1 instructions in the README.md in the issue95 branch....

saramsey commented 3 years ago

OK, before running step (8), I forgot to update extract-semmeddb.sh to pull version 43 of SemMedDB. Fixing that now (see 7cb6caa and 2d0e139)

saramsey commented 3 years ago

Running this code in the instance, to update the code to the tip if the issue95 branch, before running step (8) of the instructions:

<ctrl>-a d
cd ~/kg2-code && git pull
screen -r
ecwood commented 3 years ago

The build-kg2-snakemake-n.log file ends with:

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+ date
Thu Aug  5 00:00:40 UTC 2021
+ echo '================ script finished ============================'
================ script finished ============================

so looks like normal output (?). Proceeding with step (8) in the Option 1 instructions in the README.md in the issue95 branch....

Going forward, I would recommend checking that the expected number of rules ran as well. If you are doing a partial build, you should also verify that the rules you want are there and the rules you don't aren't.

ecwood commented 3 years ago

If this build goes smoothly and you opt to merge the branch into master, please remember to edit these lines: https://github.com/RTXteam/RTX-KG2/blob/2d0e139c3b8a4736d0e2579e49b281d2377647b6/build-kg2-snakemake.sh#L122-L123 to say origin/master instead. (Essentially, these lines ensure that, if you want the KG2 nodes file to be generates, the potentially sed-ed files (see lines above) aren't used.)

saramsey commented 3 years ago

Build died on Thursday 5 Aug 2021 at 0739 UTC

Screen Shot 2021-08-05 at 7 34 00 AM

shutting down the instance for now. I will investigate as soon as I get to work.

saramsey commented 3 years ago

OK, in /home/ubuntu/kg2-build/build-kg2-snakemake.log (which has an , I am seeing the following error message:

Error in rule UniChem:
    jobid: 33
    output: /home/ubuntu/kg2-build/unichem/unichem-mappings.tsv
    log: /home/ubuntu/kg2-build/extract-unichem.log (check log file(s) for error message)
    shell:
        bash -x /home/ubuntu/kg2-code/extract-unichem.sh /home/ubuntu/kg2-build/unichem/unichem-mappings.tsv > /home/ubuntu/kg2-build/extract-unichem.log 2>&1
        (exited with non-zero exit code)
...
[Thu Aug  5 07:29:47 2021]
Finished job 30.
19 of 48 steps (40%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/.snakemake/log/2021-08-05T001205.093121.snakemake.log
saramsey commented 3 years ago

Archiving build-kg2-snakemake.log here:

build-kg2-snakemake.log.zip

saramsey commented 3 years ago

Checking the last line of /home/ubuntu/kg2-build/extract-unichem.log, I am seeing:

+ curl -s -L -f ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/oracleDumps/UDRI371/UC_XREF.txt.gz

The file's modification timestamp is:

-rw-rw-r-- 1 ubuntu ubuntu 2.1K Aug  5 00:12 extract-unichem.log
saramsey commented 3 years ago

It appears that the UDRI371 directory is no longer present on the EBI FTP server, as shown here:

Screen Shot 2021-08-05 at 9 28 49 AM
saramsey commented 3 years ago

The build-kg2-snakemake-n.log file ends with:

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+ date
Thu Aug  5 00:00:40 UTC 2021
+ echo '================ script finished ============================'
================ script finished ============================

so looks like normal output (?). Proceeding with step (8) in the Option 1 instructions in the README.md in the issue95 branch....

Going forward, I would recommend checking that the expected number of rules ran as well. If you are doing a partial build, you should also verify that the rules you want are there and the rules you don't aren't.

OK, thanks, I have noted this in the README.md

saramsey commented 3 years ago
Screen Shot 2021-08-05 at 10 24 52 AM

see #106

saramsey commented 3 years ago

Verified that the bugfix works, using:

/home/ubuntu/kg2-venv/bin/snakemake --snakefile /home/ubuntu/kg2-code/Snakefile -R --until UniChem
saramsey commented 3 years ago

Resuming the build using:

bash -x ~/kg2-code/build-kg2-snakemake.sh all
saramsey commented 3 years ago

Seeing an error in the ~/kg2-build/build-kg2-snakemake.log file:

[Thu Aug  5 17:30:23 2021]
Error in rule KEGG:
    jobid: 47
    output: /home/ubuntu/kg2-build/kegg.json
    log: /home/ubuntu/kg2-build/extract-kegg.log (check log file(s) for error message)
    shell:
        bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.l
og 2>&1
        (exited with non-zero exit code)

tracking the fix for this issue in #107

saramsey commented 3 years ago

Seeing an error in ~/kg2-build/build-kg2-snakemake.log:

Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py", line 279, in <module>
    version_number = json_data['version'][0]['version']
KeyError: 'version'
[Thu Aug  5 17:30:27 2021]
Error in rule DrugCentral_Conversion:
    jobid: 23
    output: /home/ubuntu/kg2-build/kg2-drugcentral.json
    shell:
        /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py  /home/ubuntu/k
g2-build/drugcentral/drugcentral_psql_json.json /home/ubuntu/kg2-build/kg2-drugcentral.json
        (exited with non-zero exit code)

tracking the fix for this issue in #108

saramsey commented 3 years ago

Seeing an error in the DrugCentral_Conversion rule in ~/kg2-build/build-kg2-snakemake.log file:

Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py", line 279, in <module>
    version_number = json_data['version'][0]['version']
KeyError: 'version'
[Thu Aug  5 17:30:27 2021]
Error in rule DrugCentral_Conversion:
    jobid: 23
    output: /home/ubuntu/kg2-build/kg2-drugcentral.json
    shell:
        /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py  /home/ubuntu/kg2-build/drugcentral/drugcentral_psql_json.json /home/ubuntu/kg2-build/kg2-drugcentral.json
        (exited with non-zero exit code)

tracking this issue as #109

saramsey commented 3 years ago

Error in task Reactome_Conversion:

[Thu Aug  5 17:31:29 2021]
Error in rule Reactome_Conversion:
    jobid: 20
    output: /home/ubuntu/kg2-build/kg2-reactome.json
    log: /home/ubuntu/kg2-build/reactome-mysql-to-kg-json.log (check log file(s) for error message)
    shell:
        /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/reactome_mysql_to_kg_json.py  /home/ubuntu/kg2-build/mysql-config.conf reactome /home/ubuntu/kg2-build/kg2-reactome.json > /home/ubuntu/kg2-build/reactome-mysql-to-kg-json.log 2>&1
        (exited with non-zero exit code)

tracking this as #110

saramsey commented 3 years ago

Error in task Ensembl_Conversion:

Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 180, in <module>
    graph = make_kg2_graph(input_file_name, test_mode)
  File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 137, in make_kg2_graph
    name = transcript['name']
KeyError: 'name'
[Thu Aug  5 17:32:19 2021]
Error in rule Ensembl_Conversion:
    jobid: 11
    output: /home/ubuntu/kg2-build/kg2-ensembl.json
    shell:
        /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/ensembl_json_to_kg_json.py  /home/ubuntu/kg2-build/ensembl/ensembl_genes_homo_sapiens.json /home/ubuntu/kg2-build/kg2-ensembl.json
        (exited with non-zero exit code)

tracking this as #111

saramsey commented 3 years ago

Ran this test on buildkg2.rtx.ai:

bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.log 2>&1

and got this error:

+ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/query_kegg.py /home/ubuntu/kg2-build/kegg.json
Traceback (most recent call last):
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/caches/file_cache.py", line 72, in __init__
    from lockfile import LockFile
ModuleNotFoundError: No module named 'lockfile'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/query_kegg.py", line 124, in <module>
    kg2_util.save_json(run_queries(), args.outputFile, True)
  File "/home/ubuntu/kg2-code/query_kegg.py", line 90, in run_queries
    for results in send_query(query).split('\n'):
  File "/home/ubuntu/kg2-code/query_kegg.py", line 37, in send_query
    requests = CacheControlHelper()
  File "/home/ubuntu/RTX-KG2/cache_control_helper.py", line 32, in __init__
    self.sess = CacheControl(requests.session(), heuristic=CustomHeuristic(days=30), cache=FileCache('.web_cache'))
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/caches/file_cache.py", line 82, in __init__
    raise ImportError(notice)
ImportError:
NOTE: In order to use the FileCache you must have
lockfile installed. You can install it via pip:
  pip install lockfile
saramsey commented 3 years ago

On kg2lindsey.rtx.ai, it looks like the installed version of lockfile is 0.12.2:

(kg2-venv) ubuntu@ip-172-31-59-26:~$ pip3 freeze | grep lockfile
lockfile==0.12.2
saramsey commented 3 years ago

Rerunning the KEGG test on buildkg2.rtx.ai:

bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.log 2>&1
saramsey commented 3 years ago

Another issue in the Ensembl-Conversion rule:

ubuntu@ip-172-31-63-157:~/kg2-code$ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/ensembl_json_to_kg_json.py  /home/ubuntu/kg2-build/ensembl/ensembl_genes_homo_sapiens.json /home/ubuntu/kg2-build/kg2-ensembl.json
Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 180, in <module>
    graph = make_kg2_graph(input_file_name, test_mode)
  File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 99, in make_kg2_graph
    go_xrefs = add_prefixes_to_curie_list(gene_dict.get('GO', ''), kg2_util.CURIE_PREFIX_GO)
  File "/home/ubuntu/kg2-code/ensembl_json_to_kg_json.py", line 60, in add_prefixes_to_curie_list
    curie = curie['term'] + ' ' + curie['evidence'][0]
TypeError: can only concatenate str (not "NoneType") to str

Tracking as #112

saramsey commented 3 years ago

OK, I believe the Ensembl-Conversion bug was fixed in #112. Rerunning manually that rule on buildkg2.rtx.ai:

/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/ensembl_json_to_kg_json.py  /home/ubuntu/kg2-build/ensembl/ensembl_genes_homo_sapiens.json /home/ubuntu/kg2-build/kg2-ensembl.json

seemed to generate the file that I want:

ls -alh kg2-ensembl.json
-rw------- 1 ubuntu ubuntu 1022M Aug  5 23:47 kg2-ensembl.json
saramsey commented 3 years ago

OK, I manually ran the DrugCentral-Conversion rule and it produced an 83M file ~/kg2-build/kg2-drugcentral.json:

ubuntu@ip-172-31-63-157:~/kg2-build$ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/drugcentral_json_to_kg_json.py  /home/ubuntu/kg2-build/drugcentral/drugcentral_psql_json.json /home/ubuntu/kg2-build/kg2-drugcentral.json
ubuntu@ip-172-31-63-157:~/kg2-build$ ls -alh drugcentral/
total 1.1G
drwxrwxr-x  2 ubuntu ubuntu 4.0K Aug  5 00:16 .
drwxrwxr-x 16 ubuntu ubuntu  12K Aug  5 23:54 ..
-rw-rw-r--  1 ubuntu ubuntu 998M Aug  5 00:13 drugcentral.sql.gz
-rw-rw-r--  1 ubuntu ubuntu 124M Aug  5 00:17 drugcentral_psql_json.json
-rw-rw-r--  1 ubuntu ubuntu   27 Aug  5 00:17 psql_dump_file.txt
ubuntu@ip-172-31-63-157:~/kg2-build$ ls -alh kg2-drugcentral.json
-rw------- 1 ubuntu ubuntu 83M Aug  5 23:54 kg2-drugcentral.json
saramsey commented 3 years ago

On buildkg2.rtx.ai, I am manually running the Reactome_Conversion task:

/home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/reactome_mysql_to_kg_json.py  /home/ubuntu/kg2-build/mysql-config.conf reactome /home/ubuntu/kg2-build/kg2-reactome.json > /home/ubuntu/kg2-build/reactome-mysql-to-kg-json.log 2>&1
saramsey commented 3 years ago

Current process table on buildkg2.rtx.ai: Screen Shot 2021-08-05 at 5 07 07 PM

saramsey commented 3 years ago

Note, I am not presently testing the fix for #108 since the DrugCentral_Conversion rule seemed to complete despite those SQL errors. So issue #108 will remain marked "verify in next build" for now.

saramsey commented 3 years ago

Looks like multi_ont_to_json_kg.py exited with an error:

Error in rule Ontologies_and_TTL:
    jobid: 7
    output: /home/ubuntu/kg2-build/kg2-ont.json
    log: /home/ubuntu/kg2-build/build-multi-ont-kg.log (check log file(s) for error message)
    shell:
        bash -x /home/ubuntu/kg2-code/build-multi-ont-kg.sh /home/ubuntu/kg2-build/umls_cuis.tsv /home/ubuntu/kg2-build/kg2-ont.json  > /home/ubuntu/kg2-build/build-multi-ont-kg.log 2>&1
        (exited with non-zero exit code)

see issue #113 for details

saramsey commented 3 years ago

Here is the complete build-kg2-snakemake.log for the build that resulted in the errors reported above: build-kg2-snakemake.log.zip

saramsey commented 3 years ago

Seeing a bunch of warnings about AraPort in multi_ont_to_kg_json.py, recorded in #114

saramsey commented 3 years ago

Generating a build plan now, on buildkg2.rtx.ai, by running (as user ubuntu):

cd 
bash -x ~/kg2-code/build-kg2-snakemake.sh -n
saramsey commented 3 years ago

Just inspected the build plan build-kg2-snakemake-n.log. The plan looks correct:

Job counts:
        count   jobs
        1       Finish
        1       KEGG_Conversion
        1       Merge
        1       Ontologies_and_TTL
        1       Simplify
        1       Simplify_Stats
        1       Slim
        1       Stats
        1       TSV
        9

Here is the "-n" plan file: build-kg2-snakemake-n.log.zip

saramsey commented 3 years ago

Now running build-kg2-snakemake.sh, as user ubuntu in /home/ubuntu in a screen session on buildkg2.rtx.ai:

cd
bash -x ~/kg2-code/build-kg2-snakemake.sh
saramsey commented 3 years ago

In the build-kg2-snakemake.log file, error in rule Simplify:

Error in rule Simplify:
    jobid: 3
    output: /home/ubuntu/kg2-build/kg2-simplified.json
    log: /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log (check log file(s) for error message)
    shell:
        bash -x /home/ubuntu/kg2-code/run-simplify.sh /home/ubuntu/kg2-build/kg2.json /home/ubuntu/kg2-build/kg2-simplified.json /home/ubuntu/kg2-build/kg2-version.txt  > /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log 2>&1
        (exited with non-zero exit code)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/.snakemake/log/2021-08-06T195800.503566.snakemake.log
~

Complete logfile is attached her: build-kg2-snakemake.log.zip

See issue #115 for the details of this error message

saramsey commented 3 years ago

restarting the build by running (as user ubuntu in a screen session on system buildkg2.rtx.ai:

cd
bash -x ~/kg2-code/build-kg2-snakemake.sh
saramsey commented 3 years ago

KG2.7.2 build (on buildkg2.rtx.ai) terminated on 8/8 at 20:10 UTC; the error (from the build-kg2-snakemake.log file) is shown here:

[Sun Aug  8 20:10:46 2021]
Error in rule Simplify:
    jobid: 3
    output: /home/ubuntu/kg2-build/kg2-simplified.json
    log: /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log (check log file(s) for error message)
    shell:
        bash -x /home/ubuntu/kg2-code/run-simplify.sh /home/ubuntu/kg2-build/kg2.json /home/ubuntu/kg2-build/kg2-simplified.json /home/ubuntu/kg2-build/kg2-version.txt  > /home/ubuntu/kg2-build/filter_kg_and_remap_predicates.log 2>&1
        (exited with non-zero exit code)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/ubuntu/.snakemake/log/2021-08-08T174120.823762.snakemake.log

complete logfile here: build-kg2-snakemake.log.zip

Issue reported in detail in #116

saramsey commented 3 years ago

resuming the KG2.7.2 build on buildkg2.rtx.ai (in a screen session):

cd
bash -x ~/kg2-code/build-kg2-snakemake.sh

Then: ctrl-a d

saramsey commented 3 years ago

OK, filter_kg_and_remap_predicates.py has completed on buildkg2.rtx.ai. Now it is running slim_kg2.py.

saramsey commented 3 years ago

Confirming the correct KG2 version number in this build:

Screen Shot 2021-08-09 at 1 52 24 PM

saramsey commented 3 years ago

At 21:58 UTC on 8/9/2021, an error occurred in the rule TSV for the KG2.7.2 build on buildkg2.rtx.ai (see #117 for details).

saramsey commented 3 years ago

The build process seems to have hung at approximately 2025 UTC on Aug. 9, 2021. See #118.

saramsey commented 3 years ago

Now that #118 is (hopefully) fixed, resuming the build on buildkg2.rtx.ai....