Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.35k stars 166 forks source link

[FEAT] Adds a `read_generator` method that reads tables from a generator #3258

Closed colin-ho closed 1 week ago

colin-ho commented 1 week ago

read_generator takes in a generator function that yields Tables, with an optional parameter of num_partitions which will be the number of scan tasks that call this function.

The function will be provided the partition number as the first argument, and whatever user args after that.

Useful for testing shuffles.

codspeed-hq[bot] commented 1 week ago

CodSpeed Performance Report

Merging #3258 will degrade performances by 34.5%

Comparing colin/read-generated (70050af) with main (7e89850)

Summary

⚡ 1 improvements ❌ 1 regressions ✅ 15 untouched benchmarks

:warning: Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main colin/read-generated Change
test_iter_rows_first_row[100 Small Files] 226.7 ms 346 ms -34.5%
test_show[100 Small Files] 41.9 ms 23.6 ms +77.77%
codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 0% with 37 lines in your changes missing coverage. Please review.

Project coverage is 77.58%. Comparing base (6e28b3f) to head (70050af). Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
daft/io/_generator.py 0.00% 37 Missing :warning:
Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3258/graphs/tree.svg?width=650&height=150&src=pr&token=J430QVFE89&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3258?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) ```diff @@ Coverage Diff @@ ## main #3258 +/- ## ========================================== - Coverage 79.12% 77.58% -1.54% ========================================== Files 641 659 +18 Lines 78151 80562 +2411 ========================================== + Hits 61837 62505 +668 - Misses 16314 18057 +1743 ``` | [Files with missing lines](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3258?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc) | Coverage Δ | | |---|---|---| | [daft/io/\_generator.py](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3258?src=pr&el=tree&filepath=daft%2Fio%2F_generator.py&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc#diff-ZGFmdC9pby9fZ2VuZXJhdG9yLnB5) | `0.00% <0.00%> (ø)` | | ... and [64 files with indirect coverage changes](https://app.codecov.io/gh/Eventual-Inc/Daft/pull/3258/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Eventual-Inc)
samster25 commented 1 week ago

I guess this would be the purview of the generator being passed in?

I think so, that should likely be a parameter of the generator args or function itself