[Go] Allow adding existing arrays into structs

sfc-gh-pfus commented 5 months ago

Describe the enhancement requested

Hi! We have a use case, in which we have existing Arrow arrays and we want to compose them to struct. Unfortunately, currently StructBuilder does not support it. So currently we have to copy all values from the specific Arrow array to the similar array in struct one by one. It would be great if we can just reuse existing array without manual copying.

What we do now? Something like this:

for j := 0; j < int(numRows); j++ {
  sb.Append(true)
  for i := 0; i < structCol.NumField(); i++ {
  newInternalCol := internalCols[i]
  if newInternalCol.DataType().ID() == arrow.STRING {
    sb.FieldBuilder(i).(*array.StringBuilder).Append(newInternalCol.(*array.String).Value(j))
  } else if newInternalCol.DataType().ID() == arrow.INT8 {
    sb.FieldBuilder(i).(*array.Int8Builder).Append(newInternalCol.(*array.Int8).Value(j))

What we want:

sb.SetColumn(i, newInternalCol[i])

The main problem here is not this few lines of code that we need to write, but manual memory copying. Is it doable?

Component(s)

Go

zeroshade commented 5 months ago

Could you just use the existing interface https://pkg.go.dev/github.com/apache/arrow/go/v15@v15.0.2/arrow/array#NewStructArray to create your struct array?

newStruct, err := array.NewStructArray(internalCols, []string{"a", "b", "c", "d"....})

sfc-gh-pfus commented 5 months ago

Actually no, because what you suggest works on standard Go arrays. I have Arrow array (for instance *array.String).

zeroshade commented 5 months ago

the function I'm referring to takes a slice of arrow.Array interfaces. You would be able to do something like:

var cols []arrow.Array

sb := array.NewStringBuilder(memory.DefaultAllocator)
defer sb.Release()

sb.AppendValues([]string{"a", "b", "c"}, nil)
cols = append(cols, sb.NewStringArray())
sb.AppendValues([]string{"d", "e", "f"}, nil)
cols = append(cols, sb.NewStringArray())

newStruct, err := array.NewStructArray(cols, []string{"col1", "col2"})
defer newStruct.Release()

and newStruct would be a Struct array with two string columns. If this doesn't work, could you provide a more complete code example of your scenario?

sfc-gh-pfus commented 5 months ago

@zeroshade Sorry, I misunderstood your example! This is great. Are there similar functions for creating lists and maps? You probably know the topic in snowflake driver :) We have use case like this:

Snowflake returns data in arrow arrays (strings, ints etc). We have arrow batches API which returns this data to client.
In many cases we return these arrays as-is, but sometimes we have to make some conversions, for instance pruning decimal128 to int64 or convert two-fielded struct representing timestamp into arrow Timestamp.
Now, we introduce structured objects, lists and maps. So instead of array of "primitives" we have arrow arrays of complex structures, like structs (various types), lists of specific types or maps. But still we need to apply the same conversion schema (like pruning decimal to int64) for each "internal" field.

API you presented is great for structs. Is there anything similar for lists and maps? So for each field in struct:

newArr := convertFunction(origArr)

And I can apply it to all arrays in struct and build a struct from it using array.NewStructArray. Is there something like this so I can use it in:

newListArr := convertFunction(listArr.ListValues())
newKeyArr := convertFunction(mapArr.Keys())
newValueArr := convertFunction(mapArr.Items())

And build new list/map?

zeroshade commented 5 months ago

There isn't currently a fully convenient one like we have for structs currently, but you can easily do so by using the array.Data object directly:

newValues := convertFunction(listArr.ListValues())
defer newValues.Release()

newListData := array.NewData(listArr.DataType(), listArr.Len(), listArr.Data().Buffers(), []arrow.ArrayData{newValues.Data()}, listArr.NullN(), /* offset=*/0)
defer newListData.Release()

newListArr := array.NewListData(newListData)

Similarly you can construct map arrays, which are physically laid out as equivalent to a list<element: struct<key: K, value: V>>

That said, I think it would be a reasonable ask to add new convenience functions for this to make it easier for consumers to construct new list arrays and map arrays without having to manipulate the Data object themselves directly.

sfc-gh-pfus commented 5 months ago

@zeroshade it works! But why is this so complicated, I would have never figured it out myself :(

I only had to make tiny change - while creating newListData we should use arrow.ListOf(newValues.DataType()) instead of listArr.DataType().

Let me check if I understand how to do it for maps.

sfc-gh-pfus commented 5 months ago

That worked:

                 structArr, err := array.NewStructArray([]arrow.Array{keyCol, valueCol}, []string{"k", "v"})
        if err != nil {
            return nil, err
        }
        defer structArr.Release()
        newData := array.NewData(arrow.MapOf(keyCol.DataType(), valueCol.DataType()), mapCol.Len(), mapCol.Data().Buffers(), []arrow.ArrayData{structArr.Data()}, mapCol.NullN(), 0)
        defer newData.Release()
        return array.NewMapData(newData), nil

Thanks a lot! It would be nice to have some helper functions though :)

zeroshade commented 5 months ago

Because of the need for the offsets buffer and so on, it's tough to think of what the interface for such helper functions for Lists and Maps would be. What do you think would make sense for helper constructors for those?

candiduslynx commented 4 months ago

newListData := array.NewData(listArr.DataType(), listArr.Len(), listArr.Data().Buffers(), []arrow.ArrayData{newValues.Data()}, listArr.NullN(), /* offset=*/0)
defer newListData.Release()

newListArr := array.NewListData(newListData)

@zeroshade I think there's one vital thing to actually address here: offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why). I actually implemented the bulk of the conversion in our filetypes package: https://github.com/cloudquery/filetypes/blob/7fe7b10f029d7a853bf118eed6cf156eff375807/parquet/read.go#L77-L102

zeroshade commented 4 months ago

@candiduslynx I'm very confused by that file, it looks like you're using a builder to make a deep copy of the arrays which doesn't make any sense to me.

You have this function:

func slice(r arrow.Record) []arrow.Record {
    res := make([]arrow.Record, r.NumRows())
    for i := int64(0); i < r.NumRows(); i++ {
        res[i] = r.NewSlice(i, i+1)
    }
    return res
}

If I'm reading this right, you're slicing the record int a slice of records of exactly 1 row each? Why?

But I'm more confused by this reverseTransformArray function. reverseTransformRecord appears to loop through the columns and call reverseTransformArray with each column's type and the column itself. Which means that reverseTransformArray is receiving a datatype which is always identical to the datatype of the Array being passed in. This means that something like reverseTransformFromString is always called with a string array. So you're creating a builder and building up a new array which is identical to the one that was passed in rather than just calling .Retain() on the array and returning the same array that was passed in or using the .Copy method on the ArrayData object and then using MakeFromData. If you really wanted to make a deep copy (but why??), you could do it in a generic way by explicitly copying the buffers regardless of the type to create new ArrayData objects.

offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).

What do you mean "special"? The offset handling for struct arrays should work precisely the same as any other type. Can you elaborate on what the issue there is?

candiduslynx commented 4 months ago

If I'm reading this right, you're slicing the record int a slice of records of exactly 1 row each? Why?

Tests requirement, it'll be removed soon (we need to sort rows in tests, so we slice to single-row records).

But I'm more confused by this reverseTransformArray function. reverseTransformRecord appears to loop through the columns and call reverseTransformArray with each column's type and the column itself.

The schema passed to the reverseTransformRecord function doesn't necessarily match the schema in the record read. We have to convert some columns to other types to be better represented in parquet (maybe we should revisit this, actually). We have similar code in our duckdb plugin and we use the parquet formatting to put the data into the tables. Unfortunately, DuckDB doesn't support all of the types 1-to-1, so we are converting some values. That also means that to reconstruct the record read for tests we need to perform the reverse transformation.

I'll revisit the code in filetypes package as there seems to be some discrepancy, but overall it is what it is.

offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).

What do you mean "special"? The offset handling for struct arrays should work precisely the same as any other type. Can you elaborate on what the issue there is?

https://github.com/cloudquery/filetypes/pull/279 I noticed that when working with sliced struct arrays the arr.Data().Offset() would return unusable info. That's because the underlying arrays are sliced, too, so constructing the struct array this way fails (you have to use 0).

zeroshade commented 4 months ago

Tests requirement, it'll be removed soon (we need to sort rows in tests, so we slice to single-row records).

I've been wanting to add sort functions to the compute package (which already has Take & Filter).... if you want to try your hand at contributing sort functions, I'd be happy to review them and it would solve your issue of making single-row slices :)

I noticed that when working with sliced struct arrays the arr.Data().Offset() would return unusable info. That's because the underlying arrays are sliced, too, so constructing the struct array this way fails (you have to use 0).

I'd be curious what a minimal reproducer of this might look like. In theory for a sliced array, only the top-level needs the offset. So if your child arrays are already sliced, then it makes sense you don't need to have an offset in the top struct as offsets are only related to the Array they are attached to.

So i can think of three possible situations here:

There's a bug in the slicing logic that propagated an offset to children during slicing when it shouldn't have
There's a bug in the accessing logic (IsValid/IsNull/Value/etc.) which didn't properly handle the offset somewhere
There's a mistake somewhere in the code when constructing the arrays regarding handling what should or shouldn't be sliced and where

candiduslynx commented 4 months ago

There's a mistake somewhere in the code when constructing the arrays regarding handling what should or shouldn't be sliced and where

func (a *Struct) setData(data *Data) {
    a.array.setData(data)
    a.fields = make([]arrow.Array, len(data.childData))
    for i, child := range data.childData {
        if data.offset != 0 || child.Len() != data.length {
            sub := NewSliceData(child, int64(data.offset), int64(data.offset+data.length))
            a.fields[i] = MakeFromData(sub)
            sub.Release()
        } else {
            a.fields[i] = MakeFromData(child)
        }
    }
}

I think is the culprit

zeroshade commented 4 months ago

I'll take a look later this week and see what i can figure out

candiduslynx commented 2 weeks ago

cc: @yevgenypats @erezrokah (seems like the repo moved)

apache / arrow-go

[Go] Allow adding existing arrays into structs #35

Describe the enhancement requested

Component(s)