Open mike-roberts-healx opened 2 years ago
HI @mike-roberts-healx thanks for reaching out. 10 minutes for a 500mb file does seem like a very long time. What is your network connection like? Is this something that is consistently reproducible? How fast was it with the chunk_size
set to 1MB? Also I'll share the documentation on this here just for reference: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.
Hi Tim, thanks for getting back on this. In order:
moto
to mock out the S3 bucket, so it is only talking to local memory not going over the Internet. It definitely seems like an algorithmic issue within botocore
rather than a connection problem.Thanks @mike-roberts-healx for following up. I wonder if running this inside an ECS task could be a factor - did you happen to test this independently of ECS? Also have you tried file sizes between 1MB and 500MB, and if so was there much variability on that spectrum of <10 seconds to ~10 minutes?
I saw this related post on Stack Overflow: https://stackoverflow.com/questions/73352049/problem-with-streaming-large-files-with-boto3-to-s3-on-ec2. There are a few suggestions there that may help but please let me know if it overlaps with your use case.
The script I posted runs locally and demonstrates the issue, so I don't think it's anything to do with ECS. When downloading a file with very long lines, read
and iter_chunks
both run quickly but iter_lines
runs very slowly.
Describe the bug
StreamingBody.iter_lines
seems to be extremely slow when dealing with a file with very long lines. We were using this in ECS to parse a ~500mb json lines file containing fairly large objects and it was taking upwards of 10 minutes to run.Expected Behavior
Performance should be comparable to
read
oriter_chunks
, regardless of how long the lines are.Current Behavior
It is way slower than reading the file in other ways.
Reproduction Steps
Minimal repro (requires
moto
) that reads a 10MB file with no line breaks:Output on
botocore==1.27.87
:Possible Solution
The implementation of
iter_lines
looks to be quadratic in the length of the lines:If there are no line breaks in any of the chunks, then every time it goes round in this loop it is doing:
(pending + chunk)
, which requires allocating + copying into a new buffersplitlines
, which requires iterating through the whole buffer again looking for line breaksSo pending keeps growing longer, and every time we have to copy it and iterate through it, so it gets slower quadratically until there is a line break.
A better implementation would probably be to maintain a list of pending chunks and concatenate them only when a line break is reached.
Additional Information/Context
Increasing
chunk_size
to 1MB fixed the immediate problem we were having.SDK version used
1.27.87
Environment details (OS name and version, etc.)
Ubuntu 22.04, python 3.10.4