RTXteam / RTX-KG2

Build system for the RTX-KG2 biomedical knowledge graph, part of the ARAX reasoning system (https://github.com/RTXTeam/RTX)
MIT License
34 stars 9 forks source link

Add retry capabilities for `kg2_util.download_file_if_not_exist_locally` to avoid systemic failures #365

Open d33bs opened 5 months ago

d33bs commented 5 months ago

In attempting to recreate and run the RTX-KG2 pipeline I found that kg2_util.download_file_if_not_exist_locally sometimes had a tendency to fail on first attempt to download a resource. When this happens it can sometimes cause a full pipeline failure for singular download errors. I'm unsure of the exact cause (it could be network location based, or file hosting service, etc) but generally found that the network requests were sometimes initially unable to complete and would succeed if tried again. To avoid this, a simple retry loop may provide a programmatic way to retry a download without much time cost when it comes to a full pipeline run. Adding this retry loop I found removed the cases where I saw failures.

Current code (link):

def download_file_if_not_exist_locally(url: str, local_file_name: str):
    if url is not None:
        local_file_path = pathlib.Path(local_file_name)
        if not local_file_path.is_file():
            ctx = ssl.create_default_context()
            ctx.check_hostname = False
            ctx.verify_mode = ssl.CERT_NONE
            # the following code is ugly but necessary because sometimes the TLS
            # certificates of remote sites are broken and some of the PURL'd
            # URLs resolve to HTTPS URLs (would prefer to just use
            # urllib.request.urlretrieve, but it doesn't seem to support
            # specifying an SSL "context" which we need in order to ignore the cert):
            temp_file_name = tempfile.mkstemp(prefix=TEMP_FILE_PREFIX + '-')[1]
            with urllib.request.urlopen(url, context=ctx) as u, open(temp_file_name, 'wb') as f:
                f.write(u.read())
            shutil.move(temp_file_name, local_file_name)
    return local_file_name

Suggested code:

def download_file_if_not_exist_locally(url: str, local_file_name: str):
    # general imports for portability
    import ssl
    import pathlib
    import tempfile
    import urllib
    import urllib.parse
    import urllib.request
    import shutil

    # import http for exception handling
    from http import client

    # set maximum retries possible
    MAX_RETRIES = 3

    if url is not None:
        local_file_path = pathlib.Path(local_file_name)
        if not local_file_path.is_file():
            # create a loop for attempting retries
            for attempt in range(1, MAX_RETRIES + 1):
                try:
                    ctx = ssl.create_default_context()
                    ctx.check_hostname = False
                    ctx.verify_mode = ssl.CERT_NONE
                    # the following code is ugly but necessary because sometimes the TLS
                    # certificates of remote sites are broken and some of the PURL'd
                    # URLs resolve to HTTPS URLs (would prefer to just use
                    # urllib.request.urlretrieve, but it doesn't seem to support
                    # specifying an SSL "context" which we need in order to ignore the cert):
                    temp_file_name = tempfile.mkstemp(prefix=TEMP_FILE_PREFIX + "-")[1]

                    with urllib.request.urlopen(url, context=ctx) as u, open(
                        temp_file_name, "wb"
                    ) as f:
                        f.write(u.read())

                    shutil.move(temp_file_name, local_file_name)

                    # break out of retry loop
                    break

                # catch incomplete http read exception
                except client.IncompleteRead as e:
                    if attempt < MAX_RETRIES:
                        # if we haven't reached maximum retries, try again
                        print(f"Attempt {attempt} failed. Retrying...")
                    else:
                        # If max retries reached, raise the exception
                        raise

    return local_file_name
saramsey commented 4 months ago

Thank you for the patch to implement retry capability, @d33bs ! We will get this into the main branch.

d33bs commented 4 months ago

Sounds good, thank you as well @saramsey ! If it'd be helpful, may I submit a PR towards this cause? I can also understand if you wouldn't suggest it.