Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

[BUG] Implement deserialize for Python objects serialized as sequences #3339

Closed kevinzwang closed 1 day ago

kevinzwang commented 2 days ago

visit_seq is used when using serde_json to serialize/deserialize Rust objects, since the byte buffer is just stored as a list of numbers in JSON.

codspeed-hq[bot] commented 2 days ago

CodSpeed Performance Report

Merging #3339 will degrade performances by 52.79%

Comparing kevin/pyobject-deser-seq (31ed37c) with main (7922d2d)

Summary

❌ 2 regressions ✅ 15 untouched benchmarks

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main kevin/pyobject-deser-seq Change
test_iter_rows_first_row[100 Small Files] 188.3 ms 229.2 ms -17.87%
test_show[100 Small Files] 14.8 ms 31.4 ms -52.79%
codecov[bot] commented 2 days ago

Codecov Report

Attention: Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.

Project coverage is 77.17%. Comparing base (a9bf7c0) to head (31ed37c). Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/common/py-serde/src/python.rs 0.00% 11 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3339/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3339?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #3339 +/- ## ========================================== - Coverage 77.37% 77.17% -0.20% ========================================== Files 677 678 +1 Lines 82864 83228 +364 ========================================== + Hits 64113 64235 +122 - Misses 18751 18993 +242 ``` | [Files with missing lines](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3339?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [src/common/py-serde/src/python.rs](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3339?src=pr&el=tree&filepath=src%2Fcommon%2Fpy-serde%2Fsrc%2Fpython.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-c3JjL2NvbW1vbi9weS1zZXJkZS9zcmMvcHl0aG9uLnJz) | `72.05% <0.00%> (-5.72%)` | :arrow_down: | ... and [41 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3339/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)

🚨 Try these New Features:

kevinzwang commented 1 day ago

Seems fine, but should we really be using JSON as the serde mechanism instead of bytes?

I had this question too, but we can address that later. @samster25 wrote the IO Config pickling so maybe he should know the rationale.

samster25 commented 1 day ago

I had this question too, but we can address that later. @samster25 wrote the IO Config pickling so maybe he should know the rationale.

I did serde_json at the start since we all had were ints and strings and it was easy to read as a human and debug. However given that we now have bytes, pickles and whatnot inside of the payload, I think it makes sense to just encode it as bytes instead.

jaychia commented 2 hours ago

@kevinzwang maybe good as a quick follow-up to switch us to using bytes

kevinzwang commented 2 hours ago

@kevinzwang maybe good as a quick follow-up to switch us to using bytes

https://github.com/Eventual-Inc/Daft/pull/3400 PTAL!