apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.63k stars 3.56k forks source link

[GO] array.Binary and array.String should use int64 offsets. #44806

Closed tosinva-stripe closed 11 hours ago

tosinva-stripe commented 13 hours ago

Describe the bug, including details regarding any error messages, version, and platform.

LargeBinary and LargeString use int64 offsets, however Binary and String types use int32 offsets, this makes them susceptible to slice index out of bounds errors when the column/array is larger than ~2GB ~= 2^31 bytes.

To reproduce try deserializing a parquet file that is greater than 2.2 GB.

A workaround is to force the go library to deserialize the field/column as LargeBinary instead of Binary:

Error looks like:

panic: runtime error: slice bounds out of range [:-2147483014]

goroutine 95 [running]:
github.com/apache/arrow/go/v17/arrow/array.(*Binary).Value(...)
    /go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:59
github.com/apache/arrow/go/v17/arrow/array.(*Binary).ValueStr(0xc000178d20?, 0xc091402a00?)
    /go/pkg/mod/github.com/apache/arrow/go/v17@v17.0.0/arrow/array/binary.go:67 +0xfa
extractorvalidator/data.BootstrapRecordsFromParquet({0x1de1a40, 0xcc6a9775f0}, 0x0)
    /.../data/records.go:78 +0x582
main.validationWorker({0x1dccd90, 0x2c31840}, 0x0?, {0x0?}, 0xc0000315e0, 0xc000001de0, 0xc0000fe9c0)
    /.../command.go:428 +0x125
created by main.RunValidateCmd in goroutine 1
    /.../command.go:174 +0xb90

version and platform

Arrow Version: github.com/apache/arrow/go/v17 v17.0.0
Platform: Linux 20.04.1-Ubuntu  x86_64 x86_64 x86_64 GNU/Linux

Component(s)

Go

zeroshade commented 12 hours ago

The Go implementation has moved to the apache/arrow-go repository. Can you please move this issue to that repo? I can comment and address it there.

Thanks!

tosinva-stripe commented 11 hours ago

moved to https://github.com/apache/arrow-go/issues/195