boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.49k stars 1.09k forks source link

Upload cancellation causes long retry loop #3268

Open andy-slac opened 3 weeks ago

andy-slac commented 3 weeks ago

Describe the bug

I'm not exactly sure if this is botocore or s3transfer issue, more probably some inconsistency between the two. What I observe is that when there is an asynchronous exception (Control-C or similar "random" exception that we use to terminate the application) during S3 upload, the upload does not stop immediately, but instead it hangs for longer than a minute. This happens with multithreaded upload, single-threaded upload does not show the same issue.

Regression Issue

Expected Behavior

I would expect that Control-C or other random exception should stop upload almost immediately.

Current Behavior

Running with debug logging enabled I think I see what happens:

Here is an example of traceback dumped by boto from one of the threads when it goes into a retry loop:

2024-09-30 19:21:01,572 ThreadPoolExecutor-0_7 DEBUG botocore.httpsession (<...>/lib/python3.11/site-packages/botocore/httpsession.py:508) :: Exception received when sending urllib3 HTTP request
Traceback (most recent call last):
  File "<...>/lib/python3.11/site-packages/botocore/httpsession.py", line 464, in send
    urllib_response = conn.urlopen(
                      ^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 96, in request
    rval = super().request(method, url, body, headers, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/urllib3/connection.py", line 244, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "<...>/lib/python3.11/http/client.py", line 1303, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1349, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/http/client.py", line 1298, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 130, in _send_output
    self._handle_expect_response(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
    self._send_message_body(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
    self.send(message_body)
  File "<...>/lib/python3.11/site-packages/botocore/awsrequest.py", line 223, in send
    return super().send(str)
           ^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/http/client.py", line 1009, in send
    datablock = data.read(self.blocksize)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/utils.py", line 520, in read
    data = self._fileobj.read(amount_to_read)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/lib/python3.11/site-packages/s3transfer/upload.py", line 90, in read
    raise self._transfer_coordinator.exception
s3transfer.exceptions.FatalError: Received signal SIGINT.

Reproduction Steps

Here is the code that I used to debug the issue, the code is actually a trivial call to file_upload method. The trick is to trigger upload cancellation by hitting Ctrl-C at a correct moment. In our setup I used ~100MB file which gave me a reasonable time window during the upload. If you are lucky that after hitting Ctrl-C you should see enormous amount of tracebacks printed like the one I included above.

#!/usr/bin/env python

import argparse
import logging
import signal
import traceback

import botocore
import boto3

# Configure logging to use DEBUG, this shows tracebacks from execptions
# handled by boto internally.
fmt = "%(asctime)s %(threadName)s %(levelname)s %(name)s (%(pathname)s:%(lineno)d) :: %(message)s"
logging.basicConfig(level=logging.DEBUG, format=fmt)

_log = logging.getLogger()

# In our case we try to gracefully terminate transfers when a signal is sent
# by the scheduler which executes all processing jobs. This is done by raising
# special exception which inherits BaseException so that it's not swalowed by
# regular `except Exception`, i.e. it behaves more like KeyboardInterrupt.
class GracefulShutdownInterrupt(BaseException):
    pass

def _signal_handler(signum, stack_frame):
    signame = signal.Signals(signum).name
    _log.info(f"************* Signal {signame} detected, cleaning up and shutting down. ************\n")
    traceback.print_stack(stack_frame)
    raise GracefulShutdownInterrupt(f"Received signal {signame}.")

def main():

    parser = argparse.ArgumentParser()
    parser.add_argument("file", help="File to upload to S3.")
    parser.add_argument("bucket", help="Name S3 bucket.")
    parser.add_argument("path", help="Name of the object in the bucket.")
    parser.add_argument("-p", "--profile", default=None, help="AWS profile name.")
    parser.add_argument("--endpoint-url", default=None, help="Endpoint URL.")
    parser.add_argument("--access-key-id", default=None, help="Access key ID.")
    parser.add_argument("--secret-access-key", default=None, help="Secret access key.")
    args = parser.parse_args()

    session = boto3.Session(profile_name=args.profile)

    # This configuration is what we use by default in our setup.
    config = botocore.config.Config(read_timeout=180, retries={"mode": "adaptive", "max_attempts": 10})
    client = session.client(
        "s3",
        endpoint_url=args.endpoint_url,
        aws_access_key_id=args.access_key_id,
        aws_secret_access_key=args.secret_access_key,
        config=config,
    )

    # Installing handler is optional, if KeyboardInterrupt is raised instead of
    # GracefulShutdownInterrupt then the exception raised by coordinators will
    # be different (CancelledError vs. FatalError).
    signal.signal(signal.SIGINT, _signal_handler)
    signal.signal(signal.SIGTERM, _signal_handler)
    client.upload_file(args.file, args.bucket, args.path)

if __name__ == "__main__":
    main()

Possible Solution

Maybe retry handler should understand that certain kinds of exceptions should result in immediate failure?

Additional Information/Context

Package versions:

boto3                         1.34.131
botocore                      1.34.131
s3transfer                    0.10.2
urllib3                       1.26.19

SDK version used

1.34.131

Environment details (OS name and version, etc.)

Red Hat Enterprise Linux release 8.6 (Ootpa), kernel 4.18.0-372.32.1.el8_6.x86_64

andy-slac commented 3 weeks ago

I can attach full logs from actual execution of that script with and without signal handler, each about 10MB in size.

tim-finnigan commented 3 weeks ago

Thanks for reaching out. Can you share the full logs (with any sensitive info redacted) by adding boto3.set_stream_logger('') to your script? When you say "The trick is to trigger upload cancellation by hitting Ctrl-C at a correct moment", can you elaborate on that a bit more? Is this issue only reproducible when cancelling the upload at a certain point in the process?

I'll link the Boto3 docs on retries and timeout for Config for reference, although I see you already have those applied in your snippet. I'm wondering if specifying a lower timeout and max_attempts would reduce the delay you're seeing.

andy-slac commented 3 weeks ago

Attaching the log with debug output: log-exc5-handler.log.gz

Reducing the delay is not my goal, the retries are configured reasonably in our case and there is no reason to change those. What I want is for transfer to stop instantly without any delay after the signal is received and exception is raised, there is no reason to retry in that case.

Hitting Ctrl-C at a correct moment means that moment should happen when upload has started but before it has finished, i.e. when the threads are actively pushing the data. That window may be short depending on the size of the file. I used ~100MB file in my case which gave me relatively good chance.

tim-finnigan commented 2 weeks ago

Thanks for following up. It looks like https://github.com/boto/s3transfer/issues/249 may be related. We can plan to bring this issue up with the team for further review.