HUBioDataLab / CROssBARv2

This is a repo for migration of CROssBAR data to the Neo4j database via BioCypher
6 stars 7 forks source link

PyPath download error in try-except block #3

Closed slobentanzer closed 1 year ago

slobentanzer commented 1 year ago

After poetry install of the latest CROssBAR adapter version (pypath v14.16 as per poetry.lock file), I am getting the following error:

Traceback (most recent call last):
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/scripts/create_crossbar.py", line 12, in <module>
    uniprot_data.uniprot_data_download(cache=True)
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/bccb/protein.py", line 80, in uniprot_data_download
    self.data[query_key] = uniprot.uniprot_data(
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/.venv/lib/python3.10/site-packages/pypath/inputs/uniprot.py", line 528, in uniprot_data
    return dict(
ValueError: dictionary update sequence element #126959 has length 3; 2 is required

Since this happens in the except part of the try-except block (line 80) that I already complained about in #1, I am not sure what the actual issue is. ;)

Do you get this error as well? Can you try a new installation of the project and see if it works out-of-the-box for you?

slobentanzer commented 1 year ago

PS: If you want to tune pypath downloading behaviour, you need to modify the parameters of the curl.Curl class, e.g. the timeout and retries params.

You could also try to modify the inputs.uniprot module by adding a multi_field_uniprot_data() method that allows requesting more than one field from the uniprot API, which would probably speed up the download significantly.

slobentanzer commented 1 year ago

Another update, when running multiple times with cache = False, the error is not consistent. Second time around I got

Traceback (most recent call last):
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/scripts/create_crossbar.py", line 12, in <module>
    uniprot_data.uniprot_data_download(cache=True)
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/bccb/protein.py", line 76, in uniprot_data_download
    self.data[query_key] = uniprot.uniprot_data(
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/.venv/lib/python3.10/site-packages/pypath/inputs/uniprot.py", line 528, in uniprot_data
    return dict(
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/.venv/lib/python3.10/site-packages/pypath/inputs/uniprot.py", line 528, in <genexpr>
    return dict(
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/.venv/lib/python3.10/site-packages/pypath/inputs/uniprot.py", line 531, in <genexpr>
    (
  File "/Users/slobentanzer/GitHub/CROssBAR-BioCypher-Migration/.venv/lib/python3.10/site-packages/pypath/share/curl.py", line 777, in iterfile
    for line in fileobj:
  File "/Users/slobentanzer/mambaforge/lib/python3.10/gzip.py", line 314, in read1
    return self._buffer.read1(size)
  File "/Users/slobentanzer/mambaforge/lib/python3.10/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/Users/slobentanzer/mambaforge/lib/python3.10/gzip.py", line 496, in read
    uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: invalid block type

This seems to be a sort of reproducibility issue (that nevertheless needs fixing if we want to have a stable CROssBAR migration build). I suppose this is something download failure-related that pypath probably takes care of, but we don't because we are using the input module directly, which lacks the general safety mechanisms (whichever that may be). @deeenes can probably advise here.

slobentanzer commented 1 year ago

Third time around with cache = False it worked, confirming my suspicions. We need to look at how we use the download facilities of pypath correctly.

ervau commented 1 year ago

Yes, both errors seem to be caused by a corrupted input file, probably due to a download failure. We should definitely try to increase the number of retries, but I couldn't figure out how we can modify this param without making changes inside the input module. I'm not sure if we can modify it in a way similar to this:

if not cache:
   curl.CACHE = False
slobentanzer commented 1 year ago

@ervau there is no global variable to set retries or other request parameters in the curl module, which will probably be replaced sooner or later anyways. It may be more reasonable to call the download through a higher-level pypath function. Which function that would be, however, is a question for @deeenes, as my knowledge about the pypath build process still is limited, unfortunately.

deeenes commented 1 year ago

Hello, These are little things that can be changed easily: I added a settings key curl_retries, and made uniprot_data support multiple fields.

You can use a context for the retries:

from pypath.share import settings
from pypath.inputs import uniprot

with settings.context(curl_retries = 5):
   ec_keywords = uniprot.uniprot_data(['ec', 'keywords'])

However, retries are very rare, if something requires more than 3 retries, then likely there is a problem somewhere else, e.g. you are testing it on a bad internet connection, or a slow server requires longer timeout. If the failure is timeout (see log), with more retries you can spend really long time just waiting for timeouts. The log tells you what's happening, and you can even see full curl debug:

import pypath
from pypath.share import curl
from pypath.inputs import uniprot

with curl.debug_on():
  ec_keywords = uniprot.uniprot_data(['ec', 'keywords'])

pypath.log() # or open pypath.session.log.fname
deeenes commented 1 year ago

About the original question: the first error means that some line had 3 fields instead of 2, and that was not the first line but the 126959th one. That's not normal, I would inspect the query and the response manually. However, this error will manifest in different way or will hide itself after my changes above. The second error is gnu zip library failing to decompress the response (this happens upstream of the first error, in the curl http communication). It suggests a corrupted response from the server. If this error is persistent, a workaround could be to add an Accept http header which refuses gzip compression.

deeenes commented 1 year ago

Yes, both errors seem to be caused by a corrupted input file, probably due to a download failure. We should definitely try to increase the number of retries, but I couldn't figure out how we can modify this param without making changes inside the input module. I'm not sure if we can modify it in a way similar to this:

if not cache:
   curl.CACHE = False

Hi @ervau, this is not a good way of doing it. To disable the cache at any point, use the contexts:

from pypath.share import curl
from pypath.inputs import uniprot

with curl.cache_off():
  ec = uniprot.uniprot_data('ec')

However, I don't see any way this could address the issues in this thread. If a request keeps failing, at some point you have to handle it: for example, by logging an error and skipping to the next request.

slobentanzer commented 1 year ago

Thanks, @deeenes; good solutions!

Though I am not sure about the last point; it happened to me on the university internet as well, which I assume is good (giving me these two distinct errors), and the third time it succeeded. These tries were inside of one hour, so a server problem is also unlikely I think. Thus, it's not "the requests keep failing", it's rather that they fail inconsistently. However, the logging debug context you suggested should clarify that, I hope.

slobentanzer commented 1 year ago

closed by ec35fd546f018a77d8442e681fe5080c4cd28e50