Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.76k stars 105 forks source link

[PERF]: swap out json_deserializer for simd_json #2228

Closed universalmind303 closed 1 week ago

universalmind303 commented 1 week ago

This swaps out json_deserializer for simd_json. This does show some pretty noticeable performance gains across the board (~10-20%). This is nice as not only does the local reader show improvements, the object store readers should also benefit from this.

some perf tests I ran locally

%%timeit
daft.read_json('./stackexchange_sample.jsonl').collect()

# branch: main
# 64.4 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# branch: simdjson
# 52.5 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit

# tpch sf 5 'customer' table
daft.read_json('./customer.json').collect()

# branch: main
# 289 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# branch: simdjson 
# 244 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit:

Added support for preserving order It isnt' quite as performant as the unordered version, but it is still noticeably faster than using json_deserializer

# stack_exchange_sample.jsonl
# 56.9 ms ± 833 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# tpch sf 5 'customer' table
# 256 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)