[GO] array.Binary and array.String should use int64 offsets.

tosinva-stripe commented 13 hours ago

Describe the bug, including details regarding any error messages, version, and platform.

LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.

To reproduce try deserializing a parquet file that is greater than 2.2 GB.

A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:

explicitly store the arrow schema during write. see store_schema https://arrow.apache.org/docs/cpp/parquet.html#roundtripping-arrow-types-and-schema
and schema explicitly uses the large_binary or large_string type when defining the schema that is used to write the parquet files.

Error looks like:

panic: runtime error: slice bounds out of range [:-2147483014]

goroutine 95 [running]:
github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
    /go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 0xc091402a00?)
    /go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 +0xfa
extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 0xcc6a9775f0}, 0x0)
    /.../data/records.go:78 +0x582
main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 0xc000001de0, 0xc0000fe9c0)
    /.../command.go:428 +0x125
created by main.RunValidateCmd in goroutine 1
    /.../command.go:174 +0xb90

version and platform

Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
Platform: Linux 20.04.1-Ubuntu  x86_64 x86_64 x86_64 GNU/Linux

Component(s)

Go

zeroshade commented 12 hours ago

The Go implementation has moved to the apache/arrow-go repository. Can you please move this issue to that repo? I can comment and address it there.

Thanks!

tosinva-stripe commented 11 hours ago

moved to https://github.com/apache/arrow-go/issues/195

apache / arrow

[GO] array.Binary and array.String should use int64 offsets. #44806

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)