Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.82k stars 113 forks source link

[PERF]: local json reader #2264

Closed universalmind303 closed 4 weeks ago

universalmind303 commented 1 month ago

closes https://github.com/Eventual-Inc/Daft/issues/2196

universalmind303 commented 1 month ago

Some benchmarks using tpch scale 5 of "customer" table

Included polars to give a point of reference.

# polars with projection 
pl.scan_ndjson('./customer.json').select("c_mktsegment").collect()
# daft with projection
daft.read_json('./customer.json').select("c_mktsegment").collect()

# polars without projection 
pl.scan_ndjson('./customer.json').collect()
# daft without projection
daft.read_json('./customer.json').collect()

# polars (with projection)
# 76.8 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# daft (with projection)
# 116 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# daft main (with projection)
# 181 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# polars (without projection)
# 89 ms ± 2.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# daft (without projection)
# 169 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# daft main (without projection)
# 247 ms ± 6.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
samster25 commented 1 month ago

assigning @clarkzinzow to take a look!