apache / arrow-go

Official Go implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
50 stars 10 forks source link

unsupported cast to string_view from utf8 in v18 #184

Open fwojciec opened 3 weeks ago

fwojciec commented 3 weeks ago

Describe the bug, including details regarding any error messages, version, and platform.

Casting between string <-> string_view types has been added in C++ and Python implementations of arrow in v18: https://github.com/apache/arrow/pull/43302

I tested this in Python and it works with the latest (18.0.0) pyarrow release:

[ins] In [1]: import pyarrow as pa

[ins] In [2]: import pyarrow.compute as pc

[ins] In [3]: a = pa.array(["one", "two", "three"])

[ins] In [4]: a
Out[4]: 
<pyarrow.lib.StringArray object at 0x111e3b340>
[
  "one",
  "two",
  "three"
]

[ins] In [5]: a.cast(pa.string_view())
Out[5]: 
<pyarrow.lib.StringViewArray object at 0x111a3be80>
[
  "one",
  "two",
  "three"
]

It doesn't work with the v18 release of arrow-go:

package main

import (
    "context"
    "fmt"

    "github.com/apache/arrow-go/v18/arrow"
    "github.com/apache/arrow-go/v18/arrow/array"
    "github.com/apache/arrow-go/v18/arrow/compute"
    "github.com/apache/arrow-go/v18/arrow/memory"
)

func main() {
    alloc := memory.NewGoAllocator()
    b := array.NewStringBuilder(alloc)
    defer b.Release()
    b.AppendValues([]string{"a", "b", "c"}, nil)
    arr := b.NewArray()
    fmt.Println(arr)

    newArr, err := compute.CastArray(context.Background(), arr, &compute.CastOptions{
        ToType: arrow.BinaryTypes.StringView,
    })
    if err != nil {
        panic(err)
    }
    fmt.Println(newArr)
}

// Output:
// ["a" "b" "c"]
// panic: not implemented: unsupported cast to string_view from utf8
// ...
// exit status 2

I'd be happy to work on the PR to add it if someone can give me a pointer with description of what would it take - high-level - to add support for this in Go.

Component(s)

Release

zeroshade commented 3 weeks ago

Hey @fwojciec I'd love to get this cast in. Here's a quick overview of what would be necessary to do so, I'll happily review any PR you make:

  1. The top-level of the casts for strings can be found here: https://github.com/apache/arrow-go/blob/main/arrow/compute/cast.go#L519. The calls to addFn are for adding casts based on the source type and associating compute kernels for casting to them.
  2. The kernels set up here based on the output type, by calling addToBinaryKernels to add the appropriate casting kernels for String/LargeString/Binary/LargeBinary by using generics. 3. You'll need to create a new function for casting to StringView/BinaryView since they don't have "Large" variants like String and Binary (for example, notice that there are separate Cast*Fsb* functions for handling Fixed Size Binary types). You could then add it to the kernels appended by addToBinaryKernels in the same way as we do for fixed-size-binary.
  3. Finally, unit tests can be added here: https://github.com/apache/arrow-go/blob/main/arrow/compute/cast_test.go#L1489 when you are done.

If you run into any issues or difficulties, feel free to put up a draft PR and ping me and I'll help out!

Thanks again for looking into this!

fwojciec commented 2 weeks ago

Thank you @zeroshade - that's super helpful! I'll try to have a PR in the next few days.