RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
39 stars 8 forks source link

Error in `rule KEGG` #230

Closed acevedol closed 1 year ago

acevedol commented 2 years ago
Error in rule KEGG:
    jobid: 47
    output: /home/ubuntu/kg2-build/kegg.json
    log: /home/ubuntu/kg2-build/extract-kegg.log (check log file(s) for error message)
    shell:
        bash -x /home/ubuntu/kg2-code/extract-kegg.sh /home/ubuntu/kg2-build/kegg.json > /home/ubuntu/kg2-build/extract-kegg.log 2>&1
        (exited with non-zero exit code)

Originally posted by @acevedol in https://github.com/RTXteam/RTX-KG2/issues/221#issuecomment-1222725111 Build issue

acevedol commented 2 years ago
+ /home/ubuntu/kg2-venv/bin/python3 -u /home/ubuntu/kg2-code/query_kegg.py /home/ubuntu/kg2-build/kegg.json
/home/ubuntu/kg2-venv/lib/python3.7/site-packages/rdflib_jsonld/__init__.py:12: DeprecationWarning: The rdfli
b-jsonld package has been integrated into rdflib as of rdflib==6.0.0.  Please remove rdflib-jsonld from your
project's dependencies.
  DeprecationWarning,
Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/query_kegg.py", line 124, in <module>
    kg2_util.save_json(run_queries(), args.outputFile, True)
  File "/home/ubuntu/kg2-code/query_kegg.py", line 90, in run_queries
    for results in send_query(query).split('\n'):
  File "/home/ubuntu/kg2-code/query_kegg.py", line 38, in send_query
    res = requests.get(query, timeout=120)
  File "/home/ubuntu/RTX-KG2/cache_control_helper.py", line 39, in get
    return self.sess.get(url, params=params, timeout=timeout, headers=headers)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/adapter.py", line 53, in send
    resp = super(CacheControlAdapter, self).send(request, **kw)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/adapters.py", line 533, in send
    return self.build_response(request, resp)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/adapter.py", line 71, in build_respons
e
    response = self.heuristic.apply(response)
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/heuristics.py", line 43, in apply
    updated_headers = self.update_headers(response)
  File "/home/ubuntu/RTX-KG2/cache_control_helper.py", line 16, in update_headers
    date = parsedate(response.headers['date'])
  File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/urllib3/_collections.py", line 157, in __getitem__
    val = self._container[key.lower()]
KeyError: 'date'
acevedol commented 2 years ago

Traceback (most recent call last): File "/home/ubuntu/kg2-code/query_kegg.py", line 124, in kg2_util.save_json(run_queries(), args.outputFile, True) File "/home/ubuntu/kg2-code/query_kegg.py", line 90, in run_queries for results in send_query(query).split('\n'): File "/home/ubuntu/kg2-code/query_kegg.py", line 38, in send_query res = requests.get(query, timeout=120) File "/home/ubuntu/RTX-KG2/cache_control_helper.py", line 39, in get return self.sess.get(url, params=params, timeout=timeout, headers=headers) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/sessions.py", line 555, in get return self.request('GET', url, kwargs) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/sessions.py", line 542, in reques t resp = self.send(prep, send_kwargs) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/sessions.py", line 655, in send r = adapter.send(request, kwargs) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/adapter.py", line 53, in send resp = super(CacheControlAdapter, self).send(request, kw) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/requests/adapters.py", line 533, in send return self.build_response(request, resp) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/adapter.py", line 71, in buil d_response response = self.heuristic.apply(response) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/cachecontrol/heuristics.py", line 43, in a pply updated_headers = self.update_headers(response) File "/home/ubuntu/RTX-KG2/cache_control_helper.py", line 16, in update_headers date = parsedate(response.headers['date']) File "/home/ubuntu/kg2-venv/lib/python3.7/site-packages/urllib3/_collections.py", line 157, in g etitem val = self._container[key.lower()] KeyError: 'date'

acevedol commented 2 years ago
Traceback (most recent call last):
  File "kg2-code/query_kegg.py", line 129, in <module>
    kg2_util.save_json(run_queries(), args.outputFile, True)
  File "kg2-code/query_kegg.py", line 108, in run_queries
    results_dict[results[0]] = {'name': results[1]}
IndexError: list index out of range
acevedol commented 2 years ago
Traceback (most recent call last):
  File "kg2-code/query_kegg.py", line 133, in <module>
    kg2_util.save_json(run_queries(), args.outputFile, True)
  File "kg2-code/query_kegg.py", line 118, in run_queries
    results_dict[results[1]]['eq_id'] = results[0]
KeyError: 'cpd:C00462\\nchebi:17051'
acevedol commented 2 years ago

Rule completed successfully...

ecwood commented 1 year ago

There are still issues happening with extract-kegg.sh, such as

ubuntu@ip-172-31-59-112:~/kg2-code$ ./extract-kegg.sh ~/kg2-build/kegg.json
================= starting extract-kegg.sh ==================
Thu Jun 22 21:19:07 UTC 2023
/home/ubuntu/kg2-venv/lib/python3.7/site-packages/rdflib_jsonld/__init__.py:12: DeprecationWarning: The rdflib-jsonld package has been integrated into rdflib as of rdflib==6.0.0.  Please remove rdflib-jsonld from your project's dependencies.
  DeprecationWarning,
Traceback (most recent call last):
  File "/home/ubuntu/kg2-code/query_kegg.py", line 138, in <module>
    kg2_util.save_json(run_queries(), args.outputFile, True)
  File "/home/ubuntu/kg2-code/query_kegg.py", line 109, in run_queries
    result = result.split("\\t")
AttributeError: 'list' object has no attribute 'split'
ecwood commented 1 year ago

With the changes from d850a20 and d08d383, extract-kegg.sh is now running to completion. It takes around 19 hours. The output of extract-kegg.sh is in the rtx-kg2 S3 bucket as kegg-dump-06-2023.json, if you don't want to rerun the querying on the next build.

ecwood commented 1 year ago

One thing to note, however, is that the output kegg.json is in the code directory rather than the build directory, which isn't great.

ecwood commented 1 year ago

This rule was successful in the KG2.8.4pre build, so I am going to close out this issue.