delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.14k stars 380 forks source link

Round-tripping a data frame with >1000 columns is unexpectedly slow or even fails #1582

Closed aberres closed 5 months ago

aberres commented 1 year ago

Environment

Delta-rs version: 0.10.1

Binding: Python

Environment:


Bug

What happened:

When evaluating delta-rs as a replacement for hand-managed Parquet files, I noticed that reading files with some thousand columns is unbearably slow. The number of rows does not matter; the issue can be reproduced with a single row.

The runtime does not increase linearly. See the listing below.

Reading the created Parquet file directly does not have any performance issues.

What you expected to happen:

The reading performance should not massively degrade when the number of columns increases.

How to reproduce it:

An absolutely naive script, writing and reading back a frame with a single row.

import sys
import time

import numpy as np
import pandas as pd
from deltalake import DeltaTable
from deltalake.writer import write_deltalake

column_count = int(sys.argv[1])

df = pd.DataFrame(np.zeros((1, column_count)), columns=(str(i) for i in range(column_count)))

write_deltalake("delta.table", df)

dt = DeltaTable("delta.table")

start_time = time.perf_counter()

df_in = dt.to_pandas()

print(f"Needed {round(time.perf_counter() - start_time, 2)}s to read {column_count} rows.")

More details:

Needed 0.01s to read 100 rows.
Needed 0.06s to read 200 rows.
Needed 0.16s to read 300 rows.
Needed 0.34s to read 400 rows.
Needed 0.62s to read 500 rows.
Needed 1.04s to read 600 rows.
Needed 1.71s to read 700 rows.
Needed 2.41s to read 800 rows.
Needed 3.35s to read 900 rows.
Needed 4.57s to read 1000 rows.
Needed 8.74s to read 1250 rows.
Needed 15.05s to read 1500 rows.

~2000 rows did not finish within several minutes.~ Do not succeed at all, the process crashed with SIGBUS.

Snakeviz profile:

image

ylashin commented 6 months ago

I tried with the below specs and performance is ok:

Needed 0.01s to read 100 rows.
Needed 0.02s to read 200 rows.
Needed 0.02s to read 300 rows.
Needed 0.03s to read 400 rows.
Needed 0.03s to read 500 rows.
Needed 0.05s to read 600 rows.
Needed 0.05s to read 700 rows.
Needed 0.05s to read 800 rows.
Needed 0.06s to read 900 rows.
Needed 0.08s to read 1000 rows.
Needed 0.1s to read 1250 rows.
Needed 0.11s to read 1500 rows.
ion-elgreco commented 5 months ago

@ylashin thanks! Closing this then as resolved