Closed colin-ho closed 5 days ago
Comparing colin/swordfish-mono-id
(509e645) with main
(84db665)
⚡ 2
improvements
✅ 15
untouched benchmarks
Benchmark | main |
colin/swordfish-mono-id |
Change | |
---|---|---|---|---|
⚡ | test_iter_rows_first_row[100 Small Files] |
421.4 ms | 375 ms | +12.36% |
⚡ | test_show[100 Small Files] |
32.7 ms | 14.9 ms | ×2.2 |
Attention: Patch coverage is 84.84848%
with 20 lines
in your changes missing coverage. Please review.
Project coverage is 74.96%. Comparing base (
84db665
) to head (509e645
). Report is 17 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
...execution/src/sinks/monotonically_increasing_id.rs | 77.01% | 20 Missing :warning: |
🚨 Try these New Features:
Implements monotonically increasing id as a streaming sink with
max_concurrency = 1
.I tested multithreaded and single threaded implementations and found that there was no performance gain in multithreaded. This is because monotonically increasing id is a memory bound operator, all it does is allocate an array of ints for the id. Multiple threads trying to do this in parallel are bottlenecked by memory bandwidth.
It is actually also much simpler to implement this as a single threaded operation, as we just need to keep a running count of the lengths of morsels seen so far. This is effectively just
row_number
.Note:
pyfunc_into_table_iter
function, which consumes python iterators in scan tasks (used in read_lance and read_generator), where the consumer only callsnext()
on the iterator once. This PR fixes that.