apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Adds partial `cast` support for run-end encoded arrays #6752

Closed RyanMarcus closed 1 day ago

RyanMarcus commented 3 days ago

Which issue does this PR close?

Helps address part of " Support REE in cast kernels" in #3520

Rationale for this change

Adds support for (limited) casting/conversion of REE arrays.

What changes are included in this PR?

We've added these specific casts because we needed them for our own project, and figured we could contribute them back.

Are there any user-facing changes?

Currently, can_cast_type basically returns false for anything involving an REE array. This PR implements a subset of important casts, but not all of them. Users may be surprised that, for example, a REE array of Int32 can be converted to an array of Int32, but an REE array of Utf8 can't be converted to a Utf8 array (yet).

RyanMarcus commented 3 days ago

I know you just gave it as an example, but take would be significantly slower since it would require materializing an index array, which can't be trivially computed from the run ends (unlike for dictionaries, where the values are the indices).

I took a brief look at the other existing kernels and didn't see anything that looked promising. If you can point me to something specific I'm happy to take a look!

I'm not currently familiar with MutableArrayData, let me take a look and get back to you. Edit: Just looked at MutableArrayData which seems like a perfect candidate for Copy types. Let me try to change my implementation to that, I agree it has potential to cut down a lot on codegen.

One mitigating factor to the extra codegen is the fact that this is limited to the cast subcrate. We could also add an arrow-cast-ree or similar, if that's a concern.

RyanMarcus commented 3 days ago

As you suspected @tustvold , MutableArrayData enabled an implementation that was much smaller! And should reuse all the existing generics codegen from transform.

In order to make this efficient, we need a extend_n function on MutableArrayData that copies the same bits many times. Here is a dummy (working) implementation:

    pub fn extend_n(&mut self, index: usize, start: usize, end: usize, n: usize) {
        for _ in 0..n {
            self.extend(index, start, end);
        }
    }

I can see that MutableArrayData uses some kind of dynamic dispatch magic for each call to extend which I don't fully understand, but I understand enough to know that my current extend_n is no good. I'll mark this PR as a draft until I figure it out.

RyanMarcus commented 2 days ago

I've added a more efficient extend_n function that works by passing the count parameter n to each closure. But I realized that in the worst case scenario, where each run length is 1, this approach will still do interpretation for every value.

A simple benchmark converting a REE array of i32s to a primitive array:

With PrimitiveBuilder:
cast run end to flat    time:   [34.395 ms 34.453 ms 34.519 ms]

With MutableArrayData:
cast run end to flat    time:   [48.710 ms 48.869 ms 49.067 ms]

... so the avoiding the interpretation overhead seems to cause a 30% speedup. Does a 30% performance improvement in the worst case justify the extra codegen, @tustvold ?

Perhaps a reasonable middle ground would be to use the PrimitiveBuilder for primitive types, but use MutableArrayData for non-primitive types?

tustvold commented 2 days ago

What is the average run length in that case?

We could specialise primitives, and there are ways we can reduce codegen - e.g. only specialize i32 run ends and only specialise on ArrowNativeType not ArrowPrimitiveType like we do for dictionaries.

However, given how few people use RunArray, there are relatively few scenarios it makes sense, we need to err on the side of keeping codegen manageable

RyanMarcus commented 2 days ago

My benchmark is a worst-case scenario, so every run length is 1, and thus the average is 1 as well. Not a realistic scenario, but illustrative of the worst case.

If we increase all run lengths to 10 (which is the average in my application, at least) and keeping the logical data size the same, the results are:

With PrimitiveBuilder:
cast run end to flat    time:   [11.740 ms 11.778 ms 11.818 ms]

With MutableArrayData:
cast run end to flat    time:   [21.837 ms 21.917 ms 22.000 ms]

Both approaches get faster, but the relative gap is larger.

The current version hits the compromise you mentioned: the specialized kernel is used for {i/u/f}{8/16/32/64}, and the interpretation-powered one is used for the rest.

I am not fully confident I correctly modified all of the MutableArrayData to handle the extend_n function, but the tests at least pass. If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.

tustvold commented 2 days ago

If you want, I can swap back to the "dummy" version I proposed earlier, since the specialized kernel should hit the most common cases.

Let's swap it back

RyanMarcus commented 1 day ago

See https://github.com/apache/arrow-rs/pull/6752#discussion_r1851025127 -- I don't think this is worthwhile if minimizing codegen is such a high priority. If someone else is looking at this PR, feel free to use my kernels for your REE arrays!