boto / botocore

The low-level, core functionality of boto3 and the AWS CLI.
Apache License 2.0
1.49k stars 1.09k forks source link

Very slow performance when using iter_lines on an s3 object with long lines #2774

Open mike-roberts-healx opened 2 years ago

mike-roberts-healx commented 2 years ago

Describe the bug

StreamingBody.iter_lines seems to be extremely slow when dealing with a file with very long lines. We were using this in ECS to parse a ~500mb json lines file containing fairly large objects and it was taking upwards of 10 minutes to run.

Expected Behavior

Performance should be comparable to read or iter_chunks, regardless of how long the lines are.

Current Behavior

It is way slower than reading the file in other ways.

Reproduction Steps

Minimal repro (requires moto) that reads a 10MB file with no line breaks:

from moto import mock_s3
import boto3
import time

BUCKET = "test_bucket"
KEY = "long_lines.txt"
KILOBYTES = 1024
MEGABYTES = 1024 * KILOBYTES

@mock_s3
def slow_iter_lines():
    s3 = boto3.resource("s3")
    s3.create_bucket(Bucket=BUCKET)
    obj = s3.Object(BUCKET, KEY)
    obj.put(Body=b"a" * (10*MEGABYTES))

    start1 = time.perf_counter()
    obj.get()["Body"].read()
    end1 = time.perf_counter()
    print(f"Normal read took {end1-start1}s")

    start2 = time.perf_counter()
    list(obj.get()["Body"].iter_chunks())
    end2 = time.perf_counter()
    print(f"Chunk iterator took {end2-start2}s")

    start3 = time.perf_counter()
    list(obj.get()["Body"].iter_lines())
    end3 = time.perf_counter()
    print(f"Line iterator took {end3-start3}s")

slow_iter_lines()

Output on botocore==1.27.87:

Normal read took 0.003736798000318231s
Chunk iterator took 0.008655662000819575s
Line iterator took 26.232641663998947s

Possible Solution

The implementation of iter_lines looks to be quadratic in the length of the lines:

    def iter_lines(self, chunk_size=_DEFAULT_CHUNK_SIZE, keepends=False):
        pending = b''
        for chunk in self.iter_chunks(chunk_size):
            lines = (pending + chunk).splitlines(True)
            for line in lines[:-1]:
                yield line.splitlines(keepends)[0]
            pending = lines[-1]
        if pending:
            yield pending.splitlines(keepends)[0]

If there are no line breaks in any of the chunks, then every time it goes round in this loop it is doing:

So pending keeps growing longer, and every time we have to copy it and iterate through it, so it gets slower quadratically until there is a line break.

A better implementation would probably be to maintain a list of pending chunks and concatenate them only when a line break is reached.

Additional Information/Context

Increasing chunk_size to 1MB fixed the immediate problem we were having.

SDK version used

1.27.87

Environment details (OS name and version, etc.)

Ubuntu 22.04, python 3.10.4

tim-finnigan commented 2 years ago

HI @mike-roberts-healx thanks for reaching out. 10 minutes for a 500mb file does seem like a very long time. What is your network connection like? Is this something that is consistently reproducible? How fast was it with the chunk_size set to 1MB? Also I'll share the documentation on this here just for reference: https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html.

mike-roberts-healx commented 2 years ago

Hi Tim, thanks for getting back on this. In order:

tim-finnigan commented 1 year ago

Thanks @mike-roberts-healx for following up. I wonder if running this inside an ECS task could be a factor - did you happen to test this independently of ECS? Also have you tried file sizes between 1MB and 500MB, and if so was there much variability on that spectrum of <10 seconds to ~10 minutes?

I saw this related post on Stack Overflow: https://stackoverflow.com/questions/73352049/problem-with-streaming-large-files-with-boto3-to-s3-on-ec2. There are a few suggestions there that may help but please let me know if it overlaps with your use case.

mike-roberts-healx commented 1 year ago

The script I posted runs locally and demonstrates the issue, so I don't think it's anything to do with ECS. When downloading a file with very long lines, read and iter_chunks both run quickly but iter_lines runs very slowly.