Closed RyanMarcus closed 1 day ago
I know you just gave it as an example, but take
would be significantly slower since it would require materializing an index array, which can't be trivially computed from the run ends (unlike for dictionaries, where the values are the indices).
I took a brief look at the other existing kernels and didn't see anything that looked promising. If you can point me to something specific I'm happy to take a look!
I'm not currently familiar with MutableArrayData, let me take a look and get back to you. Edit: Just looked at MutableArrayData
which seems like a perfect candidate for Copy
types. Let me try to change my implementation to that, I agree it has potential to cut down a lot on codegen.
One mitigating factor to the extra codegen is the fact that this is limited to the cast
subcrate. We could also add an arrow-cast-ree
or similar, if that's a concern.
As you suspected @tustvold , MutableArrayData
enabled an implementation that was much smaller! And should reuse all the existing generics codegen from transform
.
In order to make this efficient, we need a extend_n
function on MutableArrayData
that copies the same bits many times. Here is a dummy (working) implementation:
pub fn extend_n(&mut self, index: usize, start: usize, end: usize, n: usize) {
for _ in 0..n {
self.extend(index, start, end);
}
}
I can see that MutableArrayData
uses some kind of dynamic dispatch magic for each call to extend
which I don't fully understand, but I understand enough to know that my current extend_n
is no good. I'll mark this PR as a draft until I figure it out.
I've added a more efficient extend_n
function that works by passing the count parameter n
to each closure. But I realized that in the worst case scenario, where each run length is 1, this approach will still do interpretation for every value.
A simple benchmark converting a REE array of i32s to a primitive array:
With PrimitiveBuilder:
cast run end to flat time: [34.395 ms 34.453 ms 34.519 ms]
With MutableArrayData:
cast run end to flat time: [48.710 ms 48.869 ms 49.067 ms]
... so the avoiding the interpretation overhead seems to cause a 30% speedup. Does a 30% performance improvement in the worst case justify the extra codegen, @tustvold ?
Perhaps a reasonable middle ground would be to use the PrimitiveBuilder
for primitive types, but use MutableArrayData
for non-primitive types?
What is the average run length in that case?
We could specialise primitives, and there are ways we can reduce codegen - e.g. only specialize i32 run ends and only specialise on ArrowNativeType not ArrowPrimitiveType like we do for dictionaries.
However, given how few people use RunArray, there are relatively few scenarios it makes sense, we need to err on the side of keeping codegen manageable
My benchmark is a worst-case scenario, so every run length is 1, and thus the average is 1 as well. Not a realistic scenario, but illustrative of the worst case.
If we increase all run lengths to 10 (which is the average in my application, at least) and keeping the logical data size the same, the results are:
With PrimitiveBuilder:
cast run end to flat time: [11.740 ms 11.778 ms 11.818 ms]
With MutableArrayData:
cast run end to flat time: [21.837 ms 21.917 ms 22.000 ms]
Both approaches get faster, but the relative gap is larger.
The current version hits the compromise you mentioned: the specialized kernel is used for {i/u/f}{8/16/32/64}, and the interpretation-powered one is used for the rest.
I am not fully confident I correctly modified all of the MutableArrayData
to handle the extend_n
function, but the tests at least pass. If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.
If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.
Let's swap it back
See https://github.com/apache/arrow-rs/pull/6752#discussion_r1851025127 -- I don't think this is worthwhile if minimizing codegen is such a high priority. If someone else is looking at this PR, feel free to use my kernels for your REE arrays!
Which issue does this PR close?
Helps address part of " Support REE in cast kernels" in #3520
Rationale for this change
Adds support for (limited) casting/conversion of REE arrays.
What changes are included in this PR?
PrimitiveArray
under the hood).We've added these specific casts because we needed them for our own project, and figured we could contribute them back.
Are there any user-facing changes?
Currently,
can_cast_type
basically returns false for anything involving an REE array. This PR implements a subset of important casts, but not all of them. Users may be surprised that, for example, a REE array of Int32 can be converted to an array of Int32, but an REE array of Utf8 can't be converted to a Utf8 array (yet).