delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.33k stars 411 forks source link

Slow add_actions.to_pydict for tables with large number of columns, impacting read performance #2733

Closed xbrianh closed 3 months ago

xbrianh commented 3 months ago

Environment

Delta-rs version:

pip show deltalake Name: deltalake Version: 0.18.1 Summary: Native Delta Lake Python binding based on delta-rs with Pandas integration

Binding: Python


Bug

What happened: Slow add_actions.to_pydict() for large numbers of columns.

What you expected to happen: same info faster

How to reproduce it:

df = pd.DataFrame(np.random.random(size=(4000, 40000)))
deltalake.write_deltalake("table", df)
add_actions = deltalake.DeltaTable("table").get_add_actions()

start = time.time()
add_actions.to_pydict()
print("duration", time.time() - start)

On some azure instances I see ~27 seconds. On my M2 mac performance is better at ~9 seconds, but this still seems slow.

More details: This seems unusually slow, and also impacts deltalake read operations here.

ion-elgreco commented 3 months ago

@xbrianh it's mainly slow because .to_pydict moves all values of the RecordBatch in a python dict. Due to wide amount of cols, you also get a lot of unnecessary stuff about empty stats.

I've pushed a PR to fix this