Open alamb opened 2 weeks ago
FYI
@alamb just to clarify your idea is to modify the existing take
& filter
kernels. Not create a new one right?
@alamb just to clarify your idea is to modify the existing
take
&filter
kernels. Not create a new one right?
If creating a new one helps, no reason not to do it.
I think the idea is sound in principle, but needs a concrete API proposal.
I'm not sure the proposed builder API makes sense, as the typing for nested types like ListBuilder and DictionaryBuilder is not what we want here, and they can't easily be type erased. We also ideally want to avoid overly bloating the arrow-array crate with kernel logic. This isn't even touching on the fact these kernels don't use the builders for performance reasons.
I think we'd need to introduce a new type-erased MutableArray abstraction or something, potentially replacing the rather problematic MutableArrayData.
The only remaining challenge concerns dictionaries, as the output dictionary needs to be computed up front. Simply not supporting dictonaries could potentially be a valid workaround though.
Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value
one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion
Not sure about other type but for StringView, I can only think of iterating all the filtered row and
append_value
one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion
'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?
Not sure about other type but for StringView, I can only think of iterating all the filtered row and
append_value
one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?
With this approach, we reduce the copying to a single step. Compared to the current approach, where copying happens in multiple stages (filtering, garbage collection, and coalescing), my proposal combines these steps into one. While benchmarks are needed to confirm any performance gains, this method should, at the very least, not perform worse than the existing one
Not sure about other type but for StringView, I can only think of iterating all the filtered row and
append_value
one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?
With this approach, we reduce the copying to a single step. Compared to the current approach, where copying happens in multiple stages (filtering, garbage collection, and coalescing), my proposal combines these steps into one. While benchmarks are needed to confirm any performance gains, this method should, at the very least, not perform worse than the existing one
That makes sense. This would be a change downstream within Datafusion then correct?
Not sure about other type but for StringView, I can only think of iterating all the filtered row and
append_value
one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion'append_value' does a copy though. Wouldn't that effectively still be a large amount of copies?
With this approach, we reduce the copying to a single step. Compared to the current approach, where copying happens in multiple stages (filtering, garbage collection, and coalescing), my proposal combines these steps into one. While benchmarks are needed to confirm any performance gains, this method should, at the very least, not perform worse than the existing one
That makes sense. This would be a change downstream within Datafusion then correct?
yes, you can work on it if you want to
So it sounds like the consensus is to work out how this might look downstream in DataFusion (maybe starting with StringView as that is what is giving us the most trouble now) and then use some of that knowledge to propose something upstream in Arrow -- sounds like a good idea to me
Not sure about other type but for StringView, I can only think of iterating all the filtered row and append_value one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion
@jayzhan211 yes I think this is effectively what would happen -- however the actual iteration over filtered values is quite optimized in the filter
kernel (checkout what the FilterBuilder does) based on how many values are filtered and other aspect
The fact that filter
is so fast in arrow means it is quite hard to get as good / faster :)
@tustvold
I'm not sure the proposed builder API makes sense, as the typing for nested types like ListBuilder and DictionaryBuilder is not what we want here, and they can't easily be type erased. We also ideally want to avoid overly bloating the arrow-array crate with kernel logic.
This is reasonable -- though I could imagine adding type erased builders like DynListBuilder
for this usecase
This isn't even touching on the fact these kernels don't use the builders for performance reasons.
Is there some fundamental reason the builders can't made faster? If we could make the builders fast enough to use for filter that would seem to be valuable in its own right. But I am likely just dreaming here
The only remaining challenge concerns dictionaries, as the output dictionary needs to be computed up front. Simply not supporting dictonaries could potentially be a valid workaround though.
A builder based approach could help (e.g. optimize for the case where the input batches had the same dictionary and handle the case where they didn't -- either via deferred computation or on the fly or something else)
Not sure about other type but for StringView, I can only think of iterating all the filtered row and
append_value
one by one. If there is no further optimization we can do, I think we can implement the append logic in datafusion
Does approach seems like that for filter?
input array
and input predicate
true
in input predicate
), and append_value
to the builderAnd at least, we can avoid generate multiple small batches, and concat them(a ton of copies) when big enough.
This is reasonable -- though I could imagine adding type erased builders like DynListBuilder for this usecase
This sort of partially type-erased API seems like the worst of both worlds, you either want something that is completely type-erased (e.g. MutableArrayData), or fully typed (e.g. ListBuilder).
I could see us adding some sort of MutableArray
abstraction to arrow-select
that allows appending values from arrays based on a mask and/or selection. This would be useful not just for this use-case, but potentially as a MutableBuffer abstraction for databases, etc... However, it would be very complex to implement, especially for dictionaries.
Is there some fundamental reason the builders can't made faster
Not without changing their APIs :sweat_smile:. For the primitive builders one could simply move the current kernel implementations into the builders, but this doesn't really achieve much IMO.
A builder based approach could help (e.g. optimize for the case where the input batches had the same dictionary and handle the case where they didn't -- either via deferred computation or on the fly or something else)
Yeah, it gets very complicated and fiddly. A similar challenge likely exists for StringView, although I'm not sure what level of sophistication we've reached w.r.t automatic GC.
Does approach seems like that for filter?
That would be a very naive way to implement the filter kernel, I would encourage looking at what the selection kernels actually do.
That would be a very naive way to implement the filter kernel, I would encourage looking at what the selection kernels actually do.
I agree with it seems a naive version for filter.
Is it possible to public something like filter_native
but return a iterator, then we can reduce some copied in downstream and reuse the well optimized filter in arrow:
// current
filter --> intemediate buffer in array --> final buffer
// optimized
filter --> final buffer
This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value
here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.
This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do
append_value
here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.
To be clear -- I think the copy is of a Vec<Buffer>
(which is a Vec of pointers to the data)
This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.
I would try to implement a builder like in datafusion
This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.
Didn't see improvement on this approach https://github.com/apache/datafusion/pull/13450
This line of code extends buffer (byte copied) regardless of the filtered result, I think it is the reason why we need gc. If we do append_value here, we have additional hash lookup and insert, but less buffer copied especially in low selectivity and no gc required later on.
Didn't see improvement on this approach apache/datafusion#13450
I didn't have a chance to full look at https://github.com/apache/datafusion/pull/13450 -- can you summarize what approach it implemented?
The idea is to append the string view array as early as possible in the optimal memory usage to eliminate the need of garbage collection and aggregate those filtered rows in a single large batches instead of many small batches.
We first do Filter
then Coalesce
in two different operator. The coalescer push_batch
try to do gc for string view type. And then concat those small batches again.
I try to combine Filter
and Coalesce
in Filter
operator. And append string view type to the coalescer until the batch size (8192) or no more incoming batch. Then sent it to Coalesce
which ideally there should do nothing and pass to the next operator.
This approach has additional cost of computing each view and lookup view's hash again, but eliminate the need of gc.
The result meet my assumption that has no much difference, but currently no further improvement in my mind too.
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Upstream in DataFusion, there is a common common pattern where we have multiple input
RecordBatch
es and want to produce an outputRecordBatch
with some subset of the rows from the input batches. This happens inFilterExec
-->CoalesceBatchesExec
when filteringRepartitionExec
-->CoalesceBatchesExec
The kernels used here are:
FilterExec
usesfilter
, takes a single inputArray
and produces a single outputArray
RepartitionExec
usestake
, which also takes a single inputArray
and produces a single outputArray``RepartitionExec
each take a single input batch and produce a single outputArray
CoalesceBatchesExec
callsconcat
which takes multple Arrays and produces a single Array as outputThe use of these kernels and patterns has two downsides:
filter
/take
immediately copies the data, which is copied again inCoalesceBatches
(see illustration below)RecordBatch
es with StringView may consume significant amounts of memory for mostly filtered rows, which requires us to run gc periodically which actually slows some things down (see https://github.com/apache/datafusion/issues/11628)Here is an ascii art picture (from https://github.com/apache/datafusion/issues/7957) that shows the extra copy in action
Describe the solution you'd like
I would like to apply
filter
/take
to each incomingRecordBatch
as it arrives, copying the data to an in progress output array, in a way that is as fast as thefilter
andtake
operations. This would reduce the extra copy that is currently required.Note this is somewhat like the
interleave
kernel, except thatusize
batch index is not needed)Describe alternatives you've considered
One thing I have thought about is extending the builders so they can append more than one row at a time. For example:
Builder::append_filtered
Builder::append_take
So for example, to filter a stream of StringViewArrays I might do something like;
And also add an equivalent for
append_take
I think if we did this right, it wouldn't be a lot of new code, we could just refactor the existing filter/take implementations. For example, I would expect that the
filter
kernel would then devolve into something likeAdditional context