Very slow performance when using iter_lines on an s3 object with long lines

mike-roberts-healx commented 2 years ago

Describe the bug

StreamingBody.iter_lines seems to be extremely slow when dealing with a file with very long lines. We were using this in ECS to parse a ~500mb json lines file containing fairly large objects and it was taking upwards of 10 minutes to run.

Expected Behavior

Performance should be comparable to read or iter_chunks, regardless of how long the lines are.

Current Behavior

It is way slower than reading the file in other ways.

Reproduction Steps

Minimal repro (requires moto) that reads a 10MB file with no line breaks:

from moto import mock_s3
import boto3
import time

BUCKET = "test_bucket"
KEY = "long_lines.txt"
KILOBYTES = 1024
MEGABYTES = 1024 * KILOBYTES

@mock_s3
def slow_iter_lines():
    s3 = boto3.resource("s3")
    s3.create_bucket(Bucket=BUCKET)
    obj = s3.Object(BUCKET, KEY)
    obj.put(Body=b"a" * (10*MEGABYTES))

    start1 = time.perf_counter()
    obj.get()["Body"].read()
    end1 = time.perf_counter()
    print(f"Normal read took {end1-start1}s")

    start2 = time.perf_counter()
    list(obj.get()["Body"].iter_chunks())
    end2 = time.perf_counter()
    print(f"Chunk iterator took {end2-start2}s")

    start3 = time.perf_counter()
    list(obj.get()["Body"].iter_lines())
    end3 = time.perf_counter()
    print(f"Line iterator took {end3-start3}s")

slow_iter_lines()

Output on botocore==1.27.87:

Normal read took 0.003736798000318231s
Chunk iterator took 0.008655662000819575s
Line iterator took 26.232641663998947s

Possible Solution

The implementation of iter_lines looks to be quadratic in the length of the lines:

    def iter_lines(self, chunk_size=_DEFAULT_CHUNK_SIZE, keepends=False):
        pending = b''
        for chunk in self.iter_chunks(chunk_size):
            lines = (pending + chunk).splitlines(True)
            for line in lines[:-1]:
                yield line.splitlines(keepends)[0]
            pending = lines[-1]
        if pending:
            yield pending.splitlines(keepends)[0]

If there are no line breaks in any of the chunks, then every time it goes round in this loop it is doing:

(pending + chunk), which requires allocating + copying into a new buffer
splitlines, which requires iterating through the whole buffer again looking for line breaks

So pending keeps growing longer, and every time we have to copy it and iterate through it, so it gets slower quadratically until there is a line break.

A better implementation would probably be to maintain a list of pending chunks and concatenate them only when a line break is reached.

Additional Information/Context

Increasing chunk_size to 1MB fixed the immediate problem we were having.

SDK version used

1.27.87

Environment details (OS name and version, etc.)

Ubuntu 22.04, python 3.10.4

tim-finnigan commented 2 years ago

HI @mike-roberts-healx thanks for reaching out. 10 minutes for a 500mb file does seem like a very long time. What is your network connection like? Is this something that is consistently reproducible? How fast was it with the chunk_size set to 1MB? Also I'll share the documentation on this here just for reference: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.

mike-roberts-healx commented 2 years ago

Hi Tim, thanks for getting back on this. In order:

This was running inside an ECS task, so it definitely had a solid connection to S3
It's consistently reproducible. The script I posted in the issue demonstrates the problem using moto to mock out the S3 bucket, so it is only talking to local memory not going over the Internet. It definitely seems like an algorithmic issue within botocore rather than a connection problem.
It dropped to under 10 seconds with a 1MB chunk size.

tim-finnigan commented 1 year ago

Thanks @mike-roberts-healx for following up. I wonder if running this inside an ECS task could be a factor - did you happen to test this independently of ECS? Also have you tried file sizes between 1MB and 500MB, and if so was there much variability on that spectrum of <10 seconds to ~10 minutes?

I saw this related post on Stack Overflow: https://stackoverflow.com/questions/73352049/problem-with-streaming-large-files-with-boto3-to-s3-on-ec2. There are a few suggestions there that may help but please let me know if it overlaps with your use case.

mike-roberts-healx commented 1 year ago

The script I posted runs locally and demonstrates the issue, so I don't think it's anything to do with ECS. When downloading a file with very long lines, read and iter_chunks both run quickly but iter_lines runs very slowly.

boto / botocore