I'm trying to delete 18 rows from a delta lake table that has 100M rows in it. This table is stored in Amazon S3 as 38 parquet files, which are all about 100MB each. To accomplish this, I've built a predicate that uses an IN on the "rideid", which is the column that this table is sorted by.
(overlap_result_batch is is just a RecordBatch that contains the rideids that I'd like to delete, I don't think it's important for understanding here.)
I don't expect this to be fast, since it does need to rewrite half of the data. What I'm observing though is that it only uses 1 core for 50 seconds or so, as it rewrites all 18 files one at a time.
What you expected to happen:
I would expect for delete to concurrently rewrite all of the files that need rewriting instead of doing them one at a time. I'd expect this concurrency to be exposed in configuration somewhere.
How to reproduce it:
Perform a delete of a table that contains multiple parquet files.
Environment
Delta-rs version: 0.17.3
Binding: Rust
Environment:
Bug
What happened:
I'm trying to delete 18 rows from a delta lake table that has 100M rows in it. This table is stored in Amazon S3 as 38 parquet files, which are all about 100MB each. To accomplish this, I've built a predicate that uses an
IN
on the "rideid", which is the column that this table is sorted by.(
overlap_result_batch
is is just aRecordBatch
that contains the rideids that I'd like to delete, I don't think it's important for understanding here.)I don't expect this to be fast, since it does need to rewrite half of the data. What I'm observing though is that it only uses 1 core for 50 seconds or so, as it rewrites all 18 files one at a time.
What you expected to happen: I would expect for delete to concurrently rewrite all of the files that need rewriting instead of doing them one at a time. I'd expect this concurrency to be exposed in configuration somewhere.
How to reproduce it: Perform a delete of a table that contains multiple parquet files.
More details: