boto / boto3

AWS SDK for Python
https://aws.amazon.com/sdk-for-python/
Apache License 2.0
8.97k stars 1.86k forks source link

S3 download_fileobj call 3-5x slower when file already exists on disk #3625

Open mattBrzezinski opened 1 year ago

mattBrzezinski commented 1 year ago

Describe the bug

I was doing some benchmark testing of boto3 and came across what I considered an oddity when calling the download_fileobj(). When a file with the name exists on disk already the function takes ~3x longer to execute.

Expected Behavior

I was expecting the time to run be unaffected whether the file is present or not.

Current Behavior

Existing files makes the calls last longer.

Reproduction Steps

import boto3
import os

from time import time

sess = boto3.session.Session()
client = sess.client("s3")

BUCKET = "redacted-bucket"
FILENAME = "redacted-filename"

def download_from_s3():
    with open(FILENAME, 'wb') as data:
        client.download_fileobj(BUCKET, FILENAME, data)

# S3 seems to have some form of caching, lets warm it up first
def warm_s3_cache():
    for i in range(1,4):
        download_from_s3()
        os.remove(FILENAME)

def example():
    start = time()
    download_from_s3()
    end = time()
    print(f"Time taken: {end - start}")

warm_s3_cache()
example()  # Time taken: 8.71937370300293
example()  # Time taken: 45.56892156600952
os.remove(FILENAME)
example()  # Time taken: 7.929774761199951
example()  # Time taken: 45.05253601074219

Possible Solution

No response

Additional Information/Context

I'm not 100% familiar w/ Python intricacies so I'm unsure if this is an expected affect, however I was not able to find this question / concern posted before hand so wanted to reach out about it.

SDK version used

boto3==1.26.68

Environment details (OS name and version, etc.)

S3 bucket resides in us-east-2, the code running here is on an m7g.8xlarge EC2 instance in the same region.

aBurmeseDev commented 1 year ago

Hi @mattBrzezinski - thanks for reaching out. I attempted to replicate with this code example provided in the doc using ~20 files and wasn't able to reproduce this behavior on my end. A couple questions for you:

mattBrzezinski commented 1 year ago

Hi @mattBrzezinski - thanks for reaching out. I attempted to replicate with this code example provided in the doc using ~20 files and wasn't able to reproduce this behavior on my end. A couple questions for you:

* How many files are you working with and what's the average size of the files?

* Does it occur intermittently?

* Can you share debug logs by adding `boto3.set_stream_logger('')` to your code?

How many files are you working with and what's the average size of the files?

I've ran this code w/ a variety of files, twenty 256MB files, five 1GB files, and a single 5GB file. This happens across the board.

Does it occur intermittently?

This happens consistently for me.

Can you share debug logs by adding boto3.set_stream_logger('') to your code?

I am unable to upload the logs to GitHub as they're too big (83MB), here is a link to them in my Google Drive however. The logs unfortunately do not show it in their timestamps. But in these areas,

2023-03-14 15:04:50,803 s3transfer.tasks [DEBUG] Executing task IOWriteTask(transfer_id=0, {'offset': 4987551744}) with kwargs {'fileobj': <_io.BufferedWriter name='file-5gb'>, 'offset': 4987551744}
2023-03-14 15:04:50,803 s3transfer.utils [DEBUG] Acquiring 0
2023-03-14 15:04:50,804 s3transfer.futures [DEBUG] Submitting task CompleteDownloadNOOPTask(transfer_id=0, {}) to executor <s3transfer.futures.BoundedExecutor object at 0xffff95e469a0> for transfer request: 0.
2023-03-14 15:04:50,804 s3transfer.utils [DEBUG] Releasing acquire 0/None
2023-03-14 15:04:50,804 s3transfer.utils [DEBUG] Acquiring 0

There are long periods of time, specifically after the line Releasing acquire, where nothing seems to be happening. This seems to me like a more generic Python IO issue rather than a boto3 one.

aBurmeseDev commented 1 year ago

@mattBrzezinski - thanks for sharing additional info. I'll have to agree with you that the logs wasn't very helpful in this case and I couldn't find anything related specific to your situation.

However, I'd suggest giving a shot with the code example from the docs since that's how I attempted to reproduce and didn't see the slowness on my end. Could you also elaborate on your use case again in details that I'd like to bring it to our team for a discussion? Do you see the slowness only for the files that already existed in your instance?

mattBrzezinski commented 1 year ago

However, I'd suggest giving a shot with the code example from the docs since that's how I attempted to reproduce and didn't see the slowness on my end.

That code example is actually where I based mine off of 😅

Could you also elaborate on your use case again in details that I'd like to bring it to our team for a discussion?

I was just testing to see how setting different TransferConfig parameters affect the throughput when downloading files from S3. I was able to work around this issue by just deleting the files between each test run, but figured I'd log an issue as this could affect others in a more realistic way.

Do you see the slowness only for the files that already existed in your instance?

Correct. If the files do not exist on the instance I get download times that are to be expected. When the file already exists and I attempt to download it again I get quite the slowness. I don't think I had mentioned it but this was run using Python3.8.10.