delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python

https://delta-io.github.io/delta-rs/

Apache License 2.0

2.35k stars 414 forks source link

perf: batch json decode checkpoint actions when writing to parquet #2983

Closed alexwilcoxson-rel closed 2 weeks ago

alexwilcoxson-rel commented 2 weeks ago

Description

This change pushes more serialized json actions into the decoder before flushing. For a log with 10s of thousands of actions, the current implementation took ~18 seconds, this change dropped it to 3.

Related Issue(s)

n/a

Documentation

https://docs.rs/arrow-json/53.2.0/arrow_json/reader/struct.Decoder.html#method.decode

codecov[bot] commented 2 weeks ago

Files with missing lines	Patch %	Lines
crates/core/src/protocol/checkpoints.rs	66.66%	0 Missing and 3 partials :warning:

```diff @@ Coverage Diff @@ ## main #2983 +/- ## ======================================= Coverage 72.26% 72.27% ======================================= Files 128 128 Lines 40329 40334 +5 Branches 40329 40334 +5 ======================================= + Hits 29143 29150 +7 + Misses 9334 9331 -3 - Partials 1852 1853 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hntd187 commented 2 weeks ago

Does it make sense to tie it to record batch size? Could we instead pull that into it's own configuration or at least constant?