Open ioanastellar opened 1 year ago
Hi @ioanastellar - thank you for reaching out and for your patience.
In this scenario, due to the file size you're working with, I would suggest trying Multipart Transfers using download_fileobj
method which will trigger multi-part download when it reaches a certain threshold. Using the method is as simple as the code below:
import boto3
s3 = boto3.client('s3')
with open('filename', 'wb') as data:
s3.download_fileobj(
Bucket='mybucket',
Key='mykey',
Fileobj=data)
If you're still having trouble, please share your debug logs by adding boto3.set_stream_logger('')
to your code as that would give us more insight into investigating this behavior.
Best, John
Thanks for looking at this @aBurmeseDev! Our app is actually using file_stream.iter_lines
, so changing the api to use wouldn't work. I traced the issue back to the read
api just for debugging.
Attaching the logs from a repro run: botocore_logs.txt
I'm having the same issue when using the iter_lines
or iter_chunks
methods. I need to use these instead of the above suggested methods because I want to be able to download data in fixed chunk sizes and only upto a specified line limit. This means, I do not want to download the entire file.
is the workaround here just to download the whole file? we're using iter_lines
, since we're downloading a very large csv too.
is the workaround here just to download the whole file? we're using
iter_lines
, since we're downloading a very large csv too.
The real workaround is to accept the fact that boto3
APIs time out when you try to stream byte chunks in a single request and therefore do multiple request at successive byte ranges instead to avoid timeouts. Something like this:
client = boto3.client("s3")
object_size = client.get_object_attributes(
Bucket=bucket, Key=key, ObjectAttributes=["ObjectSize"]
).get("ObjectSize")
chunk_start = 0
chunk_end = chunk_start + chunk_size - 1
while chunk_start <= object_size:
# Read specific byte range from file as a chunk. We do this because AWS server times out and sends
# empty chunks when streaming the entire file.
if body := client.get_object(
Bucket=bucket, Key=key, Range=f"bytes={chunk_start}-{chunk_end}"
).get("Body"):
chunk = body.read()
# Write your chunk to file here
chunk_start += chunk_size
chunk_end += chunk_size
cool, thanks! I had something like this previously, but was missing logic that ensured the start & end of each lines is complete.
we have another lever to pull right now, which is to reduce the file size that's written to s3, and just split across multiple smaller files that seems to be working fine with the iter_lines
approach. however, I think the true workaround is what you mentioned.
cool, thanks! I had something like this previously, but was missing logic that ensured the start & end of each lines is complete.
we have another lever to pull right now, which is to reduce the file size that's written to s3, and just split across multiple smaller files that seems to be working fine with the
iter_lines
approach. however, I think the true workaround is what you mentioned.
You can fetch data while tracking lines by adding this to the while loop without using iter_ines
:
lines = (buffer + chunk).split(b"\n")
# Pop the last line from the split as it might be incomplete
buffer = lines.pop()
line_count += len(lines)
# Row fetch limit reached
if n_rows >= 0 and (line_overflow := line_count - n_rows) > 0:
lines = lines[:-line_overflow]
break
lines_dump = b"\n".join(lines) + b"\n"
# Write lines to file here
with open("file", "ab") as f:
f.write(lines_dump)
I've come across this GitHub issue because I've recently been experiencing the same exception and it is possibly the same root cause. I wrote the following code to open an AWS CUR file to process it:
response = s3.get_object(Bucket=s3_bucket, Key=s3_key)
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)
reader = csv.DictReader(data)
for row in reader:
process_cur_row(row, match_account)
It is the iterator (for row in reader
) that is blowing up with errors like 55831137 read, but total bytes expected is 145096346
.
So do I need to rewrite this code to download the file, gunzip it and then iterate over it? A shame if I do but that is what previous comments for workarounds seem to be suggesting ...
I've implemented @codexceed 's solution and it works great, thank you!
But now I'm wondering, since we're all arriving at this issue from using boto3's _iterlines() method, wouldn't we also resolve our issue by increasing the chunk_size parameter? Doesn't this also similarly grab in larger chunks from S3?
Or is the key difference that this still keeps open the raw stream for the entire operation, whereas our solution makes requests at longer intervals for larger chunks?
@Austin-Tan , larger chunks mean longer connection sessions, which in turn means that boto3
is more likely to timeout it before completion. Hence, the multiple small chunk requests.
I am also experiencing this problem, but only on windows client python client.
Describe the bug
We're seeing
botocore.exceptions.IncompleteReadError
when callingread
multiple times on a file larger than the amount of data requested in the read, e.g. the file is 1.1GB and we call read 3 times for 1GB.Expected Behavior
I'd expect either that the second call returns all the remaining bytes, because the requested amount is larger than the remaining bytes, either that the third call does that. Instead we hit the exception.
Current Behavior
Reproduction Steps
Possible Solution
No response
Additional Information/Context
No response
SDK version used
1.14.17
Environment details (OS name and version, etc.)
Mac Ventura 13.4.1