aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.9k stars 693 forks source link

wr.s3.download fits the whole file into memory, with 2x memory allocation #2831

Open roykoand opened 4 months ago

roykoand commented 4 months ago

Describe the bug

I was using wr.s3.download on 2GiB memory VM and noticed that when I download a 1006 MiB GZIP file from S3 it allocates ~2295 MiB in both cases with and without use_threads parameter. It was measured using this memory profiler.

Obviously my script fails with OOM error on 2GiB memory machine with 2 CPUs. dmesg gives a little different memory estimation:

$ dmesg  | tail -1
Out of memory: Killed process 10020 (python3) total-vm:2573584kB, anon-rss:1644684kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:3844kB oom_score_adj:0

It turns out that wr.s3.download by default uses botocore's s3.get_object and fits whole response into a memory:

https://github.com/aws/aws-sdk-pandas/blob/7e83b89e96af33ff6eb91f6801d8b66dcd98d4f2/awswrangler/s3/_fs.py#L65-L75

Is it possible to chunkify reading of botocore response in awswrangler to be more memory efficient?

For instance, using the following snippet I got my file without any issues on the same machine:

raw_stream = s3.get_object(**kwargs)["Body"]

with open("test_botocore_iter_chunks.gz", 'wb') as f:
    for chunk in iter(lambda: raw_stream.read(64 * 1024), b''):
        f.write(chunk)

I tried to use wr.config.s3_block_size parameter expecting to chunkify the response but it does not help. After setting the s3_block_size up to be less than the file size you fall into this if condition:

https://github.com/aws/aws-sdk-pandas/blob/7e83b89e96af33ff6eb91f6801d8b66dcd98d4f2/awswrangler/s3/_fs.py#L326

which just fits the whole response into a memory

How to Reproduce

use memory profiler on

wr.s3.download(path, local_file)

Expected behavior

Please let me know if it's already possible to read chunkified response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.6.9 -- this is old, but I can double check on newer versions

AWS SDK for pandas version

2.14.0

Additional context

No response