boto / boto3

AWS SDK for Python
https://aws.amazon.com/sdk-for-python/
Apache License 2.0
8.99k stars 1.86k forks source link

upload_fileobj hangs indefinitely without returning error #4148

Closed fitnycdigitalinitiatives closed 1 month ago

fitnycdigitalinitiatives commented 4 months ago

Describe the bug

Hello,

We've recently run into a problem that seems a bit similar to #3657. Occasionally our application will stall out uploading objects to our bucket seemingly randomly using upload_fileobj without returning an error. We've run this application for years without experience this issue, but recent changes have included upgrading boto3 to 1.33.2 and RHEL to 9 (not sure if relevant to the issue). As suggested in the related issue, we turned on the debug logs and here's an example when the upload just completely hangs for hours on end:

DEBUG     2024-05-28 19:33:11  s3transfer.utils:utils:acquire:632:  Acquiring 0
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:205:  UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f94df942b20>}) about to wait for the following futures []
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:216:  UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f94df942b20>}) done waiting for dependent futures
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_execute_main:160:  Executing task UploadSubmissionTask(transfer_id=0, {'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f94df942b20>}) with kwargs {'client': <botocore.client.S3 object at 0x7f94e0189490>, 'config': <boto3.s3.transfer.TransferConfig object at 0x7f94df860a60>, 'osutil': <s3transfer.utils.OSUtils object at 0x7f94e02bcd90>, 'request_executor': <s3transfer.futures.BoundedExecutor object at 0x7f94e02bc3d0>, 'transfer_future': <s3transfer.futures.TransferFuture object at 0x7f94df942b20>}
DEBUG     2024-05-28 19:33:11  s3transfer.futures:futures:submit:318:  Submitting task CreateMultipartUploadTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'extra_args': {}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f94e02bc3d0> for transfer request: 0.
DEBUG     2024-05-28 19:33:11  s3transfer.utils:utils:acquire:632:  Acquiring 0
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:205:  CreateMultipartUploadTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'extra_args': {}}) about to wait for the following futures []
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:216:  CreateMultipartUploadTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'extra_args': {}}) done waiting for dependent futures
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_execute_main:160:  Executing task CreateMultipartUploadTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'extra_args': {}}) with kwargs {'client': <botocore.client.S3 object at 0x7f94e0189490>, 'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'extra_args': {}}
DEBUG     2024-05-28 19:33:11  s3transfer.futures:futures:submit:318:  Submitting task UploadPartTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'part_number': 1, 'extra_args': {}}) to executor <s3transfer.futures.BoundedExecutor object at 0x7f94e02bc3d0> for transfer request: 0.
DEBUG     2024-05-28 19:33:11  s3transfer.utils:utils:acquire:632:  Acquiring 0
DEBUG     2024-05-28 19:33:11  urllib3.connectionpool:connectionpool:_make_request:474:  https://s3.us-east-1.amazonaws.com:443 "POST /BUCKET/packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z?uploads HTTP/1.1" 200 None
DEBUG     2024-05-28 19:33:11  s3transfer.utils:utils:release:645:  Releasing acquire 0/None
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:205:  UploadPartTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'part_number': 1, 'extra_args': {}}) about to wait for the following futures [<s3transfer.futures.ExecutorFuture object at 0x7f94e02d68b0>]
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:210:  UploadPartTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'part_number': 1, 'extra_args': {}}) about to wait for <s3transfer.futures.ExecutorFuture object at 0x7f94e02d68b0>
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_wait_until_all_complete:216:  UploadPartTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'part_number': 1, 'extra_args': {}}) done waiting for dependent futures
DEBUG     2024-05-28 19:33:11  s3transfer.tasks:tasks:_execute_main:160:  Executing task UploadPartTask(transfer_id=0, {'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'part_number': 1, 'extra_args': {}}) with kwargs {'client': <botocore.client.S3 object at 0x7f94e0189490>, 'fileobj': <s3transfer.utils.ReadFileChunk object at 0x7f94df85bb50>, 'bucket': 'BUCKET', 'key': 'packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z', 'part_number': 1, 'extra_args': {}, 'upload_id': '2SeyzpFlJu4q6sWC9ma6r_.qparAJoyiPsNNcFW_Y_R3zbVs_G9bdPIVD_fLSy8gBhGYrawLwYUHZLunxLkDABbJL4DFTfhBY_Fj1RzR7prL6IHtDPO1UvpvBefVagKq'}
DEBUG     2024-05-28 19:33:11  urllib3.connectionpool:connectionpool:_make_request:474:  https://s3.us-east-1.amazonaws.com:443 "PUT /BUCKET/packages/d388/83aa/aa70/4879/944a/20b3/a601/3537/lb_ds_chf_000227-d38883aa-aa70-4879-944a-20b3a6013537.7z?uploadId=2SeyzpFlJu4q6sWC9ma6r_.qparAJoyiPsNNcFW_Y_R3zbVs_G9bdPIVD_fLSy8gBhGYrawLwYUHZLunxLkDABbJL4DFTfhBY_Fj1RzR7prL6IHtDPO1UvpvBefVagKq&partNumber=1 HTTP/1.1" 200 0

As you can see it just hangs after the initial PUT. In successful uploads after the first PUT, there's DEBUG 2024-05-28 19:33:05 s3transfer.utils:utils:release:645: Releasing acquire 0/None

There's nothing different about the files that it's failing on vs. the files it uploads without issue. The vast majority of uploads are successful. Just these occasional uploads stall without error.

Expected Behavior

Uploads should fail with an error.

Current Behavior

See above.

Reproduction Steps

Not able to reproduce, happens randomly when using upload_fileobj. The application uses this method:

boto_args = {
    "service_name": "s3",
    "region_name": "REGION",
    "config": config,
}
resource = boto3.resource(**boto_args)
bucket = resource.Bucket("BUCKET")
with open('filename', 'rb') as data:
    bucket.upload_fileobj(data, 'mykey')

Possible Solution

No response

Additional Information/Context

No response

SDK version used

boto3 1.33.2

Environment details (OS name and version, etc.)

rhel 9

tim-finnigan commented 3 months ago

Thanks for reaching out. In the issue you linked someone mentioned that changing the base Docker image fixed this. Have you tried reproducing the issue in environments other than RHEL? Also the latest Boto3 is 1.34.118 per the CHANGELOG, have you tried updating to that version? And are you using the CRT and if so have you tried uninstalling/reinstalling?

You could also try testing with the various S3 transfer configuration options documented here, for example:

import boto3
from boto3.s3.transfer import TransferConfig

# Create an S3 client
s3 = boto3.client('s3')

# Define the transfer config
transfer_config = TransferConfig(
    multipart_threshold=1024 * 25,  # 25 MB
    max_concurrency=10,
    multipart_chunksize=1024 * 1024,  # 1 MB
    num_download_attempts=5,
    max_io_queue=100,
    io_chunksize=1024 * 1024,  # 1 MB
    use_threads=True
)

# Open the local file
with open('file.txt', 'rb') as f:
    # Use the transfer config in an S3 upload operation
    s3.upload_fileobj(
        Fileobj=f,
        Bucket='test-bucket',
        Key='remote_file.txt',
        Config=transfer_config
    )

Otherwise I recommend following the S3 performance guideline optimizations documented here: https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html

If you're able to narrow down the conditions in which this issue occurs, please let us know.

fitnycdigitalinitiatives commented 3 months ago

Hello,

Not sure if this is related, but we recently got this error:

botocore.exceptions.HTTPClientError: An HTTP Client raised an unhandled exception: AWSHTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Pool is closed.

fitnycdigitalinitiatives commented 3 months ago

More detailed log:

Traceback (most recent call last):
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python3.9/site-packages/urllib3/connectionpool.py", line 276, in _get_conn
    conn = self.pool.get(block=self.block, timeout=timeout)
  File "/usr/lib64/python3.9/queue.py", line 165, in get
    with self.not_empty:
  File "/usr/lib64/python3.9/threading.py", line 257, in __enter__
    return self._lock.__enter__()
  File "src/gevent/_semaphore.py", line 282, in gevent._gevent_c_semaphore.Semaphore.__enter__
  File "src/gevent/_semaphore.py", line 283, in gevent._gevent_c_semaphore.Semaphore.__enter__
  File "src/gevent/_semaphore.py", line 184, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib64/python3.9/site-packages/gevent/thread.py", line 112, in acquire
    acquired = BoundedSemaphore.acquire(self, blocking, timeout)
  File "src/gevent/_semaphore.py", line 184, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 253, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
AttributeError: 'NoneType' object has no attribute 'switch'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python3.9/site-packages/botocore/httpsession.py", line 464, in send
    urllib_response = conn.urlopen(
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python3.9/site-packages/urllib3/connectionpool.py", line 704, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/usr/share/archivematica/virtualenvs/archivematica-storage-service/lib/python3.9/site-packages/urllib3/connectionpool.py", line 279, in _get_conn
    raise ClosedPoolError(self, "Pool is closed.")
urllib3.exceptions.ClosedPoolError: AWSHTTPSConnectionPool(host='s3.us-east-1.amazonaws.com', port=443): Pool is closed.
fitnycdigitalinitiatives commented 3 months ago

@tim-finnigan Hello, I did some more digging into the system logs and saw this message popping up around these issues: systemstats_linux: unexpected header length in /proc/net/snmp. 237 != 224 Which is related to this net-snmp bug

Do you think that could be causing the problem?

tim-finnigan commented 3 months ago

Thanks for following up, I'm not sure that could be causing the problem. Could you provide the full snippet you're using and complete debug logs (with sensitive info redacted) for us to investigate further? Adding boto3.set_stream_logger('') to your script will generate the full logs.

fitnycdigitalinitiatives commented 3 months ago

Hi Tim,

I don't want to speak too soon, but one thing that I recently tried was turning off threads in the TransferConfig and I haven't run into the error in the last several days. Perhaps there's something interrupting the threading process either with our application or in RHEL 9.

github-actions[bot] commented 2 months ago

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

fitnycdigitalinitiatives commented 2 months ago

Verifying that turning off threading fixed the issue. I did see online that RHEL 9 changed something with it's threading approach, but it's way beyond my understanding to know if it could be related to what I'm experiencing.

Susuvnkl commented 2 months ago

I experienced the same issue, no error log, the app just freezes. s3_client.upload_fileobj( file, bucket_name, filename, ExtraArgs={"ContentType": file.content_type} )

ATM using s3_client.put_object method instead.

tim-finnigan commented 1 month ago

Thanks for following up, I think we're still waiting on the full debug logs for further investigation of this issue.

tim-finnigan commented 1 month ago

Checking in — I was made aware of this issue involving gevent: https://github.com/gevent/gevent/issues/1826, which is present in the logs shared earlier here. This issue appears to be specific to RHEL-based systems. Please note that we do not provide or officially support gevent with our networking setup. Any issues related to gevent will need to be addressed by the gevent team.

fitnycdigitalinitiatives commented 1 month ago

Cool, good catch. thanks for your help!

github-actions[bot] commented 1 month ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.