boto / boto3

AWS SDK for Python
https://aws.amazon.com/sdk-for-python/
Apache License 2.0
8.94k stars 1.86k forks source link

Botocore.exceptions.IncompleteReadError after read from S3 file doesn't return all required bytes #3781

Open ioanastellar opened 1 year ago

ioanastellar commented 1 year ago

Describe the bug

We're seeing botocore.exceptions.IncompleteReadError when calling read multiple times on a file larger than the amount of data requested in the read, e.g. the file is 1.1GB and we call read 3 times for 1GB.

Expected Behavior

I'd expect either that the second call returns all the remaining bytes, because the requested amount is larger than the remaining bytes, either that the third call does that. Instead we hit the exception.

Current Behavior

Traceback (most recent call last):
  File "/Users/ip/tests/call_s3_read.py", line 46, in <module>
    c = file_stream.read(1073741824)
  File "/Users/ip/.pyenv/versions/pv/lib/python3.10/site-packages/botocore/response.py", line 90, in read
    self._verify_content_length()
  File "/Users/ip/.pyenv/versions/pv/lib/python3.10/site-packages/botocore/response.py", line 139, in _verify_content_length
    raise IncompleteReadError(
botocore.exceptions.IncompleteReadError: 1077927735 read, but total bytes expected is 1179535354.

Reproduction Steps

from boto3 import client
import time

s3 = client('s3')
file=s3.get_object(Bucket="<your bucket>",Key="<json file larger than 1GB>")
file_stream = file['Body'] if file else None

a = file_stream.read(1073741824) # Ask for 1GB

time.sleep(361) #Simulate the delay introduced by our processing

b = file_stream.read(1073741824) # Ask again for 1GB
print(f"b: read {len(b)} bytes") # Received only 4185911 bytes, less than the remainder of the file

c = file_stream.read(1073741824) # Ask again for 1GB. At this point we hit botocore.exceptions.IncompleteReadError: 1077927735 read, but total bytes expected is 1179535354.

Possible Solution

No response

Additional Information/Context

No response

SDK version used

1.14.17

Environment details (OS name and version, etc.)

Mac Ventura 13.4.1

aBurmeseDev commented 1 year ago

Hi @ioanastellar - thank you for reaching out and for your patience.

In this scenario, due to the file size you're working with, I would suggest trying Multipart Transfers using download_fileobj method which will trigger multi-part download when it reaches a certain threshold. Using the method is as simple as the code below:

import boto3
s3 = boto3.client('s3')

with open('filename', 'wb') as data:
    s3.download_fileobj(
         Bucket='mybucket', 
         Key='mykey', 
         Fileobj=data)

If you're still having trouble, please share your debug logs by adding boto3.set_stream_logger('') to your code as that would give us more insight into investigating this behavior.

Best, John

ioanastellar commented 1 year ago

Thanks for looking at this @aBurmeseDev! Our app is actually using file_stream.iter_lines, so changing the api to use wouldn't work. I traced the issue back to the read api just for debugging.

ioanastellar commented 1 year ago

Attaching the logs from a repro run: botocore_logs.txt

codexceed commented 1 year ago

I'm having the same issue when using the iter_lines or iter_chunks methods. I need to use these instead of the above suggested methods because I want to be able to download data in fixed chunk sizes and only upto a specified line limit. This means, I do not want to download the entire file.

sanchitram1 commented 11 months ago

is the workaround here just to download the whole file? we're using iter_lines, since we're downloading a very large csv too.

codexceed commented 11 months ago

is the workaround here just to download the whole file? we're using iter_lines, since we're downloading a very large csv too.

The real workaround is to accept the fact that boto3 APIs time out when you try to stream byte chunks in a single request and therefore do multiple request at successive byte ranges instead to avoid timeouts. Something like this:

client = boto3.client("s3")
object_size = client.get_object_attributes(
                Bucket=bucket, Key=key, ObjectAttributes=["ObjectSize"]
            ).get("ObjectSize")

chunk_start = 0
chunk_end = chunk_start + chunk_size - 1

while chunk_start <= object_size:
    # Read specific byte range from file as a chunk. We do this because AWS server times out and sends
    # empty chunks when streaming the entire file.
    if body := client.get_object(
        Bucket=bucket, Key=key, Range=f"bytes={chunk_start}-{chunk_end}"
    ).get("Body"):
        chunk = body.read()

        # Write your chunk to file here

        chunk_start += chunk_size
        chunk_end += chunk_size
sanchitram1 commented 11 months ago

cool, thanks! I had something like this previously, but was missing logic that ensured the start & end of each lines is complete.

we have another lever to pull right now, which is to reduce the file size that's written to s3, and just split across multiple smaller files that seems to be working fine with the iter_lines approach. however, I think the true workaround is what you mentioned.

codexceed commented 11 months ago

cool, thanks! I had something like this previously, but was missing logic that ensured the start & end of each lines is complete.

we have another lever to pull right now, which is to reduce the file size that's written to s3, and just split across multiple smaller files that seems to be working fine with the iter_lines approach. however, I think the true workaround is what you mentioned.

You can fetch data while tracking lines by adding this to the while loop without using iter_ines:

lines = (buffer + chunk).split(b"\n")

# Pop the last line from the split as it might be incomplete
buffer = lines.pop()
line_count += len(lines)

# Row fetch limit reached
if n_rows >= 0 and (line_overflow := line_count - n_rows) > 0:
    lines = lines[:-line_overflow]
    break

lines_dump = b"\n".join(lines) + b"\n"

# Write lines to file here
with open("file", "ab") as f:
    f.write(lines_dump)
pcolmer commented 11 months ago

I've come across this GitHub issue because I've recently been experiencing the same exception and it is possibly the same root cause. I wrote the following code to open an AWS CUR file to process it:

    response = s3.get_object(Bucket=s3_bucket, Key=s3_key)
    gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
    data = TextIOWrapper(gzipped)
    reader = csv.DictReader(data)
    for row in reader:
        process_cur_row(row, match_account)

It is the iterator (for row in reader) that is blowing up with errors like 55831137 read, but total bytes expected is 145096346.

So do I need to rewrite this code to download the file, gunzip it and then iterate over it? A shame if I do but that is what previous comments for workarounds seem to be suggesting ...

Austin-Tan commented 8 months ago

I've implemented @codexceed 's solution and it works great, thank you!

But now I'm wondering, since we're all arriving at this issue from using boto3's _iterlines() method, wouldn't we also resolve our issue by increasing the chunk_size parameter? Doesn't this also similarly grab in larger chunks from S3?

Or is the key difference that this still keeps open the raw stream for the entire operation, whereas our solution makes requests at longer intervals for larger chunks?

codexceed commented 8 months ago

@Austin-Tan , larger chunks mean longer connection sessions, which in turn means that boto3 is more likely to timeout it before completion. Hence, the multiple small chunk requests.

rokj commented 6 months ago

I am also experiencing this problem, but only on windows client python client.