aws / amazon-redshift-python-driver

Redshift Python Connector. It supports Python Database API Specification v2.0.
Apache License 2.0
202 stars 72 forks source link

Does `cursor.fetchmany()` load entire result set into memory? #213

Closed yauhen-sobaleu closed 4 months ago

yauhen-sobaleu commented 4 months ago

Question

Does cursor.fetchmany() load entire result set into memory?

I have a table with ~20GB that I need to process in my Python application.

# get cursor
cursor = self.conn.cursor()

# execute query
cursor.execute(self.query)

# column names to use in pandas dataframe
columns = self.columns_to_select
# downcasted data types for each column to be as memory-efficient as possible
types = self.column_data_types

df = pd.DataFrame(columns=columns).astype(types)

while True:

    # get chunk of data
    fetched_data: tuple = cursor.fetchmany(num=batch_size)

    if not fetched_data:
        break

    # divide it into rows
    rows: list[tuple] = [tuple(column for column in records) for records in fetched_data]

    # create pandas dataframe from the rows
    chunk_df = pd.DataFrame(rows, columns=columns).astype(types)

    # append new batch
    df = pd.concat([df, chunk_df], ignore_index=True)

    logger.debug(
        f"Fetched {i} chunk of data from the database",
        df_size_in_mb=df.memory_usage(deep=True).sum() / (1024**2),
    )

logger.debug(f"Result dataframe dtypes:\n{df.dtypes}")

Before making a single iteration my application consumes up to 10 GB of RAM until it fails with OOM error

root@ml-item-to-user-recsys-job-cron-wf-c5jgk:/app# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
1347395584
root@ml-item-to-user-recsys-job-cron-wf-c5jgk:/app# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
4818194432
root@ml-item-to-user-recsys-job-cron-wf-c5jgk:/app# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
4851089408
root@ml-item-to-user-recsys-job-cron-wf-c5jgk:/app# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
9816948736
root@ml-item-to-user-recsys-job-cron-wf-c5jgk:/app# cat /sys/fs/cgroup/memory/memory.usage_in_bytes
9838665728
root@ml-item-to-user-recsys-job-cron-wf-c5jgk:/app# command terminated with exit code 137

How is this possible? As far as I understand it must always fetch batch_size of rows and keep the rest server-side.

It is not the case for Amazon Redshift?

Driver version

2.1.0

Redshift version

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.63205

Client Operating System

Linux

Python version

3.11