apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.89k stars 3.38k forks source link

[Go] [arrow/ipc/writer] voffsets are not updated when Binary-like arrays get truncated #41993

Open notfilippo opened 1 month ago

notfilippo commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

I've stumbled upon this bug while playing around with the results returned from datafusion. It seems that in certain scenarios multiple arrays from different record batches, representing the same column, might share their value buffer. The problem arises when trying to write those records using an ipc.Writer: If one of the columns which has a single shared buffer happens to be of a binary type the written value is invalid.

This happens because each array's value buffer gets truncated to only the part referred to by its offsets but the offsets are never updated, so now they potentially point to memory outside of the truncated values. When trying to read the binary-like array via the ipc.Reader an error gets returned: string offsets out of bounds of data buffer.

How to reproduce:

func main() {
    var buf bytes.Buffer
    buf.WriteString("apple")
    buf.WriteString("pear")
    buf.WriteString("banana")
    values := buf.Bytes()

    offsets := []int32{5, 9, 15} // <-- only "pear" and "banana"
    voffsets := arrow.Int32Traits.CastToBytes(offsets)

    validity := []byte{0}
    bitutil.SetBit(validity, 0)
    bitutil.SetBit(validity, 1)

    data := array.NewData(
        arrow.BinaryTypes.String,
        2, // <-- only "pear" and "banana"
        []*memory.Buffer{
            memory.NewBufferBytes(validity),
            memory.NewBufferBytes(voffsets),
            memory.NewBufferBytes(values),
        },
        nil,
        0,
        0,
    )

    str := array.NewStringData(data)
    fmt.Println(str) // outputs: ["pear" "banana"]

    schema := arrow.NewSchema([]arrow.Field{
        {
            Name:     "string",
            Type:     arrow.BinaryTypes.String,
            Nullable: true,
        },
    }, nil)
    record := array.NewRecord(schema, []arrow.Array{str}, 2)

    var output bytes.Buffer
    writer := ipc.NewWriter(&output, ipc.WithSchema(schema))

    err := writer.Write(record)
    if err != nil {
        log.Fatal(err)
    }

    err = writer.Close()
    if err != nil {
        log.Fatal(err) 
    }

    reader, err := ipc.NewReader(bytes.NewReader(output.Bytes()), ipc.WithSchema(schema))
    if err != nil {
        log.Fatal(err)
    }

    reader.Next()
    if reader.Err() != nil {
        log.Fatal(reader.Err()) // string offsets out of bounds of data buffer
    }
}

Component(s)

Go

zeroshade commented 1 week ago

Sorry for the long delay here, I was at a conference and then things have been a bit busy. I can't promise when i'll be able to get to this, but I'll add it to my todo list. If you feel up to trying to fix it yourself, I'd be happy to review any PRs if you end up getting to it before me.

notfilippo commented 1 week ago

I would be happy to open a fix PR but I first wanted to make sure this wasn't expected behaviour. The Arrow spec says that:

When serializing this layout, we recommend normalizing the offsets to start at 0.

Which opens the possibility for this kind of issue to arise.

Also when comparing the code of the Go and C++ implementation, the IPC writer seems to have 1:1 behaviour match between the two languages, so I would assume that issue also exists there.

zeroshade commented 1 week ago

You're correct that this isn't expected behavior, likely due to this scenario being uncommon and nearly all cases I'm aware of normalize the offsets at 0. But in the general case your code should work, and it would make sense to update the C++ implementation to fix this too.