RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
34 stars 9 forks source link

Error in SemMedDB Extraction Script #294

Closed ecwood closed 11 months ago

ecwood commented 1 year ago

As part of testing for https://github.com/RTXteam/RTX-KG2/issues/291, I found that extract-semmeddb.sh fails. Here's the log file:

+ set -o nounset -o pipefail -o errexit
+ [[ '' == \-\-\h\e\l\p ]]
+ [[ '' == \-\h ]]
+ echo '================= starting extract-semmeddb.sh ================='
================= starting extract-semmeddb.sh =================
+ date
Fri Jun 23 16:52:00 UTC 2023
++ dirname extract-semmeddb.sh
+ config_dir=.
+ source ./master-config.shinc
++ '[' -z ']'
++ test_suffix=
++ BUILD_DIR=/home/ubuntu/kg2-build
++ VENV_DIR=/home/ubuntu/kg2-venv
++ CODE_DIR=/home/ubuntu/kg2-code
++ umls_dir=/home/ubuntu/kg2-build/umls
++ umls_dest_dir=/home/ubuntu/kg2-build/umls/META
++ s3_region=us-west-2
++ s3_bucket=rtx-kg2
++ s3_bucket_public=rtx-kg2-public
++ s3_bucket_versioned=rtx-kg2-versioned
++ s3_cp_cmd='aws s3 cp --no-progress --region us-west-2'
++ mysql_conf=/home/ubuntu/kg2-build/mysql-config.conf
++ curl_get='curl -s -L -f'
++ curies_to_categories_file=/home/ubuntu/kg2-code/curies-to-categories.yaml
++ curies_to_urls_file=/home/ubuntu/kg2-code/curies-to-urls-map.yaml
++ predicate_mapping_file=/home/ubuntu/kg2-code/predicate-remap.yaml
++ infores_mapping_file=/home/ubuntu/kg2-code/kg2-provided-by-curie-to-infores-curie.yaml
++ ont_load_inventory_file=/home/ubuntu/kg2-code/ont-load-inventory.yaml
++ umls2rdf_config_master=/home/ubuntu/kg2-code/umls2rdf-umls.conf
++ rtx_config_file=RTXConfiguration-config.json
++ biolink_model_version=3.1.2
+ semmed_output_file=/home/ubuntu/kg2-build/kg2-semmeddb-tuplelist.json
+ build_flag=
+ semmed_ver=VER43
+ semmed_year=2023
+ semmed_dir=/home/ubuntu/kg2-build/semmeddb
++ dirname /home/ubuntu/kg2-build/kg2-semmeddb-tuplelist.json
+ semmed_output_dir=/home/ubuntu/kg2-build
+ semmed_sql_file=semmedVER43_2023_R_WHOLEDB.sql
+ mysql_dbname=semmeddb
+ mkdir -p /home/ubuntu/kg2-build/semmeddb
+ mkdir -p /home/ubuntu/kg2-build
++ /home/ubuntu/kg2-code/get-system-memory-gb.sh
+ mem_gb=249
+ aws s3 cp --no-progress --region us-west-2 s3://rtx-kg2/semmedVER43_2023_R_WHOLEDB.sql.gz /home/ubuntu/kg2-build/semmeddb/
fatal error: An error occurred (404) when calling the HeadObject operation: Key "semmedVER43_2023_R_WHOLEDB.sql.gz" does not exist
ecwood commented 1 year ago

This error occurred because there is no semmedVER43_2023_R_WHOLEDB.sql.gz in the S3 bucket, only a semmedVER43_2021_R_WHOLEDB.sql.gz. We might want to document that to update SemMedDB, you have to download a newer copy, since it can't auto download. While investigating this, I discovered that the download (which is here) has been separated into several parts. This will be a larger task than I expected now, because we will have to separately load in each table.

ecwood commented 1 year ago

We can't import the ENTITY table. It is huge (43G compressed, 248G in MySQL). It causes the instance to run out of disk space, even when I was deleting everything we didn't need. We don't need it though, as far as @saramsey and I could tell.

+ /home/ubuntu/kg2-venv/bin/python3 /home/ubuntu/kg2-code/semmeddb_mysql_to_tuple_list_json.py /home/ubuntu/kg2-build/mysql-config.conf semmeddb VER43 2023 /home/ubuntu/kg2-build/kg2-semmeddb-tuplelist.json
/home/ubuntu/kg2-venv/lib/python3.7/site-packages/rdflib_jsonld/__init__.py:12: DeprecationWarning: The rdflib-jsonld package has been integrated into rdflib as of rdflib==6.0.0.  Please remove rdflib-jsonld from your project's dependencies.
  DeprecationWarning,
Traceback (most recent call last):
  File "/home/ubuntu/kg
ecwood commented 11 months ago

During the build (#312), this error occurred in extract-semmeddb.sh:

+ mkdir -p /home/ubuntu/kg2-build/semmeddb
++ /home/ubuntu/kg2-code/get-system-memory-gb.sh
+ mem_gb=374
+ aws s3 cp --no-progress --region us-west-2 s3://rtx-kg2/semmedVER43_2023_R_WHOLEDB.tar.gz /home/ubuntu/kg2-build/semmeddb/
download: s3://rtx-kg2/semmedVER43_2023_R_WHOLEDB.tar.gz to kg2-build/semmeddb/semmedVER43_2023_R_WHOLEDB.tar.gz
+ tar -xf /home/ubuntu/kg2-build/semmeddb/semmedVER43_2023_R_WHOLEDB.tar.gz
+ mysql --defaults-extra-file=/home/ubuntu/kg2-build/mysql-config.conf -e 'DROP DATABASE IF EXISTS semmeddb'
+ mysql --defaults-extra-file=/home/ubuntu/kg2-build/mysql-config.conf -e 'CREATE DATABASE IF NOT EXISTS semmeddb CHARACTER SET utf8 COLLATE utf8_unicode_ci'
+ zcat /home/ubuntu/kg2-build/semmeddb/semmedVER43_2023_R_CITATIONS.sql.gz
+ mysql --defaults-extra-file=/home/ubuntu/kg2-build/mysql-config.conf --database=semmeddb
gzip: /home/ubuntu/kg2-build/semmeddb/semmedVER43_2023_R_CITATIONS.sql.gz: No such file or directory
ecwood commented 11 months ago

I am closing this issue because the code worked in KG2.8.4pre's build.