Unexpectedly high memory use from seemingly simple query.

danking commented 9 months ago

What happens?

Hey there! Thanks for building duckdb, it's a really cool piece of software and I'm excited to experiment with it.

I'll start with the problem but follow with some motivation/background. I have a parquet file foo2.parquet. It is 154 MiB. I run this query (nb: there is a literal tab character in the string_split):

import duckdb
duckdb.sql('''
copy (
  select
    "#CHROM",
    "POS",
    "ID",
    "REF",
    "ALT",
    "QUAL",
    "FILTER",
    "INFO",
    "FORMAT",
    unnest(range(len(string_split(entries, '    ')))) as col_idx,
    string_split(entries, ' ')[col_idx + 1] as entry
  from 'foo2.parquet'
) to 'foo3.parquet'

I get this error:

(base) dking@wm28c-761 blah % python3 query.py                 

Traceback (most recent call last):
  File "/Users/dking/projects/blah/query.py", line 2, in <module>
    duckdb.sql('''
duckdb.duckdb.OutOfMemoryException: Out of Memory Error: failed to allocate data of size 17.1GB (17.6GB/27.4GB used)

I interpret this error as meaning that duckdb needed another 17GB (on top of perhaps 17 GB already used?) to process this query.

I know that this file, stored as a CSV occupies about 500MiB. I expected duckdb to use, worst case, a small integer multiple of the amount of memory necessary to densely represent the entire dataset in ASCII (ergo, 5GiB would be large but reasonable). 17 GiB, on the other hand seems much too large.

I suspect I'm doing something silly in my SQL that's causing extreme unnecessary memory allocation; if so, I'm sorry and also thank you for pointing that out!

Why am I doing this? This is a ~4000 sample public genome sequencing dataset. It was originally stored in a format called VCF. I massaged that a bit with pandas until I could get a parquet file with a few columns representing metadata and one column representing the rows of the sequencing matrix where each entry in that row was separated by a literal tab character.

What I'm trying to do with unnest is called "melt" in the dataframe world. I want to go from a "wide" format (where I have all the entries of the matrix for a given row in one parquet row) to a "long" format (where I have one parquet row per entry of the matrix).

I'm researching duckdb as an alternative foundation for an open-source dataframe library I maintain: https://GitHub.com/hail-is/hail.

Also, FWIW, there do not appear to be any pre-release builds available. I'm happy to build and try from the tip commit if that's helpful.

(base) dking@wm28c-761 blah % pip install duckdb --upgrade --pre
Requirement already satisfied: duckdb in /Users/dking/miniconda3/lib/python3.10/site-packages (0.9.2)

Thank you kindly for your time!

To Reproduce

I'm sorry the example is so large! I can't upload it here as a result. I've created a GitHub repo with the query and the parquet file split into three chunks. cat them together and run the query (which is included in a python file in the repo).

https://github.com/danking/duckdb-experiments

OS:

Darwin wm28c-761 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul 5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000 arm64

DuckDB Version:

0.9.2

DuckDB Client:

Python

Full Name:

Daniel King

Affiliation:

Broad Institute (non-profit biomedical research institution https://broadinstitute.org)

Have you tried this on the latest `main` branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

[X] Yes, I have

szarnyasg commented 9 months ago

Hi @danking, thanks for the kind words! Two tips: 1) First create a table from your inner query: CREATE TABLE t AS select "#CHROM", ...;, then copy this to Parquet. 2) Use the SET preserve_insertion_order = false; option to save memory: https://duckdb.org/docs/guides/performance/how-to-tune-workloads#the-preserve_insertion_order-option

danking commented 9 months ago

Hey @szarnyasg !

Thanks for the tips! These indeed seem to avoid exceeding the RAM available on my machine. Resident set size appears to peak around 15GiB (I have 32 GiB of RAM) and then settle down around 5GiB.

However, with those changes, the disk space on my machine mysteriously goes from ~32GiB free to ~1GiB free over the course of executing that code. When I kill duckdb the free disk space jumps back up to ~32GiB. I'm monitoring the disk space with df -h.

I tried explicitly setting the temporary directory and monitoring that path. Indeed, /tmp/duckdb/duckdb_temp_storage-0.tmp grows to at least 22GiB before I killed the process.

I appear to have traded RAM use for disk use, but something about this query still seems to have a "memory fork bomb" kind of effect. Somehow this tiny parquet file is causing duckdb to use tons of memory whether durable or volatile.

For reference, the updated query is:

import duckdb
duckdb.sql('''
SET preserve_insertion_order = false;
SET temp_directory = '/tmp/duckdb/';

CREATE TABLE t AS
  select
    "#CHROM",
    "POS",
    "ID",
    "REF",
    "ALT",
    "QUAL",
    "FILTER",
    "INFO",
    "FORMAT",
    unnest(range(len(string_split(entries, '    ')))) as col_idx,
    string_split(entries, ' ')[col_idx + 1] as entry
  from 'foo2.parquet';

copy t to 'foo3.parquet';
'''
)

danking commented 8 months ago

I totally understand if this is a WONTFIX / insufficient bandwidth kind of issue. Is there any chance this gets triaged in the next two months? Again, no worries either way, I just want to adjust my expectations appropriately.

taniabogatsch commented 7 months ago

Hi @danking, have a look at my comment here.

dkioroglou commented 4 months ago

I would like to add my observations on this matter.

OS

NAME="Rocky Linux"
VERSION="8.9 (Green Obsidian)"

Machine specs

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Thread(s) per core: 2
Core(s) per socket: 4
Model name: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
RAM: 32GB

Duckdb version

0.10.2

Data

I have the following CSV file:

N features: 60000 (type: integer)
N observations: 30 (type: string)
Size: 14MB

Aim

Load csv to pandas and then import pandas into duckdb using the python API. The table should have 60001 columns, that is 1 "id" column for the observations and 60000 columns named after the features.

1st implementation

I did what the documentation suggests (link):

with duckdb.connect(db) as con:
    con.sql(""CREATE TABLE test AS SELECT * FROM df"")

time: 425 seconds duckdb size after import: 189MB !!!

2nd implementation

Convert pandas df to duckdb data types.

def convert_to_duckdb_dtypes(df):
    duckdbDtypes = []
    for col in df.columns:
        if df[col].dtype == "int":
            duckdbDtypes.append("INTEGER")
        elif df[col].dtype == "float":
            duckdbDtypes.append("DOUBLE")
        else:
            duckdbDtypes.append("VARCHAR")
    return duckdbDtypes

coltypes = convert_to_duckdb_dtypes(df)
cols_n_types = ", ".join([f"{x} {y}" for x,y in zip(df.columns, coltypes)])

Create table and insert data from df

with duckdb.connect(db) as con:
    query = "CREATE TABLE test ({cols_n_types})"
    con.execute(query)

    placeholders = str(tuple(list('?')*df.shape[1])).replace("'", "")
    query = "INSERT INTO test VALUES " + placeholders
    con.executemany(query, df.values.tolist())

error: TransactionContext Error: Failed to commit: could not allocate block of size 256.0 KiB

I followed the suggestions in this and other posts (i.e SET threads TO 3, SET memory_limit = '16GB', SET preserve_insertion_order = false) but the error persisted.

Note: the above implementation worked when I transposed the dataframe (i.e 31 columns and 60000 rows). The transanction finished in 120 seconds and the duckdb after the import occupied 8.8MB.

3rd implementation

Export pandas to parquet file with pyarrow

import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df, nthreads=1)
pq.write_table(table, "testing.parquet", compression='GZIP')

time: a couple of seconds... didn't bother to measure. disk space occupied: 41MB

Import parquet file to duckdb

with duckdb.connect(db) as con:
    query = "CREATE TABLE test AS SELECT * FROM 'testing.parquet';"
    con.execute(query)

time: 220 seconds duckdb size after import: 186MB !!!

Questions

Why the 2nd implementation failed with 60001 columns?
Why duckdb occupies much bigger disk space after importing from parquet?
Why so big difference in disk space (8.8MB vs 189MB) when the same data are structured differently?

xuke-hat commented 3 months ago

@danking Hello, I've investigated your SQL and found the problem here is split returns too many rows. I tried this SQL:

select sum(len(string_split_regex(entries, '\t'))) from 'foo2.parquet';

This returns 36458233, which means unnest(range(len(string_split(entries, ' ')))) with generate 36458233 lines. INFO column has 18k length in average, so the result will have more than 600 GB! So it seems this is not a bug of DuckDB.

danking commented 3 months ago

Hey @xuke-hat ! Thanks for investigating!

Indeed, this is a ~4,000 by ~8,000 matrix of human genome sequencing data:

In [4]: import duckdb
   ...: duckdb.sql('''
   ...:   select len(string_split(entries, '^I')) as n_columns, count(*)
   ...:   from 'foo2.parquet'
   ...:   group by n_columns
   ...: '''
   ...: )
Out[4]: 
┌───────────┬──────────────┐
│ n_columns │ count_star() │
│   int64   │    int64     │
├───────────┼──────────────┤
│      4151 │         8783 │
└───────────┴──────────────┘

Moreover, you're correct that the INFO column contains a relatively large amount of uncompressed data.

However, the appeal, to me at least, of a columnar system like duckdb is the ability to both store and operate on denormalized, highly repetitive data in a compressed format. To wit, one system currently widely used in human genome sequencing, Hail, stores the "INFO" column once for each row of this matrix.

I hope that duckdb can recognize that the INFO column is constant for each element of the unnest. Consequentially, I hope that duckdb can use a run-length encoding (or other sparse representation) for the INFO column both in-memory and on-disk.

In this case, I've asked duckdb to write a Parquet file. Ideally, duckdb would use Parquet's RLE_DICTIONARY when writing these values to the Parquet file.

Moreover, even if duckdb's in-memory representation duplicated the INFO column for each unnest, I thought duckdb would stream across the values therefore, only needing some small multiple of ~200 MiB (= size of constant columns times 4151) of working memory. I understand that choosing a good on-disk encoding may need to see many values, but, I hoped that, when under memory pressure, duckdb would only inspect as many values as it had available memory.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

duckdb / duckdb