Open sfc-gh-pfus opened 5 months ago
Could you just use the existing interface https://pkg.go.dev/github.com/apache/arrow/go/v15@v15.0.2/arrow/array#NewStructArray to create your struct array?
newStruct, err := array.NewStructArray(internalCols, []string{"a", "b", "c", "d"....})
Actually no, because what you suggest works on standard Go arrays. I have Arrow array (for instance *array.String
).
the function I'm referring to takes a slice of arrow.Array
interfaces. You would be able to do something like:
var cols []arrow.Array
sb := array.NewStringBuilder(memory.DefaultAllocator)
defer sb.Release()
sb.AppendValues([]string{"a", "b", "c"}, nil)
cols = append(cols, sb.NewStringArray())
sb.AppendValues([]string{"d", "e", "f"}, nil)
cols = append(cols, sb.NewStringArray())
newStruct, err := array.NewStructArray(cols, []string{"col1", "col2"})
defer newStruct.Release()
and newStruct
would be a Struct array with two string columns. If this doesn't work, could you provide a more complete code example of your scenario?
@zeroshade Sorry, I misunderstood your example! This is great. Are there similar functions for creating lists and maps? You probably know the topic in snowflake driver :) We have use case like this:
API you presented is great for structs. Is there anything similar for lists and maps? So for each field in struct:
newArr := convertFunction(origArr)
And I can apply it to all arrays in struct and build a struct from it using array.NewStructArray
. Is there something like this so I can use it in:
newListArr := convertFunction(listArr.ListValues())
newKeyArr := convertFunction(mapArr.Keys())
newValueArr := convertFunction(mapArr.Items())
And build new list/map?
There isn't currently a fully convenient one like we have for structs currently, but you can easily do so by using the array.Data
object directly:
newValues := convertFunction(listArr.ListValues())
defer newValues.Release()
newListData := array.NewData(listArr.DataType(), listArr.Len(), listArr.Data().Buffers(), []arrow.ArrayData{newValues.Data()}, listArr.NullN(), /* offset=*/0)
defer newListData.Release()
newListArr := array.NewListData(newListData)
Similarly you can construct map arrays, which are physically laid out as equivalent to a list<element: struct<key: K, value: V>>
That said, I think it would be a reasonable ask to add new convenience functions for this to make it easier for consumers to construct new list arrays and map arrays without having to manipulate the Data object themselves directly.
@zeroshade it works! But why is this so complicated, I would have never figured it out myself :(
I only had to make tiny change - while creating newListData
we should use arrow.ListOf(newValues.DataType())
instead of listArr.DataType()
.
Let me check if I understand how to do it for maps.
That worked:
structArr, err := array.NewStructArray([]arrow.Array{keyCol, valueCol}, []string{"k", "v"})
if err != nil {
return nil, err
}
defer structArr.Release()
newData := array.NewData(arrow.MapOf(keyCol.DataType(), valueCol.DataType()), mapCol.Len(), mapCol.Data().Buffers(), []arrow.ArrayData{structArr.Data()}, mapCol.NullN(), 0)
defer newData.Release()
return array.NewMapData(newData), nil
Thanks a lot! It would be nice to have some helper functions though :)
Because of the need for the offsets buffer and so on, it's tough to think of what the interface for such helper functions for Lists and Maps would be. What do you think would make sense for helper constructors for those?
newListData := array.NewData(listArr.DataType(), listArr.Len(), listArr.Data().Buffers(), []arrow.ArrayData{newValues.Data()}, listArr.NullN(), /* offset=*/0) defer newListData.Release() newListArr := array.NewListData(newListData)
@zeroshade I think there's one vital thing to actually address here: offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why). I actually implemented the bulk of the conversion in our filetypes package: https://github.com/cloudquery/filetypes/blob/7fe7b10f029d7a853bf118eed6cf156eff375807/parquet/read.go#L77-L102
@candiduslynx I'm very confused by that file, it looks like you're using a builder to make a deep copy of the arrays which doesn't make any sense to me.
You have this function:
func slice(r arrow.Record) []arrow.Record {
res := make([]arrow.Record, r.NumRows())
for i := int64(0); i < r.NumRows(); i++ {
res[i] = r.NewSlice(i, i+1)
}
return res
}
If I'm reading this right, you're slicing the record int a slice of records of exactly 1 row each? Why?
But I'm more confused by this reverseTransformArray
function. reverseTransformRecord
appears to loop through the columns and call reverseTransformArray
with each column's type and the column itself. Which means that reverseTransformArray
is receiving a datatype which is always identical to the datatype of the Array being passed in. This means that something like reverseTransformFromString
is always called with a string array. So you're creating a builder and building up a new array which is identical to the one that was passed in rather than just calling .Retain()
on the array and returning the same array that was passed in or using the .Copy
method on the ArrayData
object and then using MakeFromData
. If you really wanted to make a deep copy (but why??), you could do it in a generic way by explicitly copying the buffers regardless of the type to create new ArrayData objects.
offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).
What do you mean "special"? The offset handling for struct arrays should work precisely the same as any other type. Can you elaborate on what the issue there is?
If I'm reading this right, you're slicing the record int a slice of records of exactly 1 row each? Why?
Tests requirement, it'll be removed soon (we need to sort rows in tests, so we slice to single-row records).
But I'm more confused by this
reverseTransformArray
function.reverseTransformRecord
appears to loop through the columns and callreverseTransformArray
with each column's type and the column itself.
The schema passed to the reverseTransformRecord
function doesn't necessarily match the schema in the record read.
We have to convert some columns to other types to be better represented in parquet (maybe we should revisit this, actually). We have similar code in our duckdb
plugin and we use the parquet formatting to put the data into the tables. Unfortunately, DuckDB doesn't support all of the types 1-to-1, so we are converting some values. That also means that to reconstruct the record read for tests we need to perform the reverse transformation.
I'll revisit the code in filetypes
package as there seems to be some discrepancy, but overall it is what it is.
offset should also be used (as the passed in record/array may be sliced), but not for struct arrays (they are special & I don't know why).
What do you mean "special"? The offset handling for struct arrays should work precisely the same as any other type. Can you elaborate on what the issue there is?
https://github.com/cloudquery/filetypes/pull/279
I noticed that when working with sliced struct arrays the arr.Data().Offset()
would return unusable info. That's because the underlying arrays are sliced, too, so constructing the struct array this way fails (you have to use 0).
Tests requirement, it'll be removed soon (we need to sort rows in tests, so we slice to single-row records).
I've been wanting to add sort functions to the compute package (which already has Take & Filter).... if you want to try your hand at contributing sort functions, I'd be happy to review them and it would solve your issue of making single-row slices :)
I noticed that when working with sliced struct arrays the arr.Data().Offset() would return unusable info. That's because the underlying arrays are sliced, too, so constructing the struct array this way fails (you have to use 0).
I'd be curious what a minimal reproducer of this might look like. In theory for a sliced array, only the top-level needs the offset. So if your child arrays are already sliced, then it makes sense you don't need to have an offset in the top struct as offsets are only related to the Array they are attached to.
So i can think of three possible situations here:
- There's a mistake somewhere in the code when constructing the arrays regarding handling what should or shouldn't be sliced and where
func (a *Struct) setData(data *Data) {
a.array.setData(data)
a.fields = make([]arrow.Array, len(data.childData))
for i, child := range data.childData {
if data.offset != 0 || child.Len() != data.length {
sub := NewSliceData(child, int64(data.offset), int64(data.offset+data.length))
a.fields[i] = MakeFromData(sub)
sub.Release()
} else {
a.fields[i] = MakeFromData(child)
}
}
}
I think is the culprit
I'll take a look later this week and see what i can figure out
cc: @yevgenypats @erezrokah (seems like the repo moved)
Describe the enhancement requested
Hi! We have a use case, in which we have existing Arrow arrays and we want to compose them to struct. Unfortunately, currently
StructBuilder
does not support it. So currently we have to copy all values from the specific Arrow array to the similar array in struct one by one. It would be great if we can just reuse existing array without manual copying.What we do now? Something like this:
What we want:
The main problem here is not this few lines of code that we need to write, but manual memory copying. Is it doable?
Component(s)
Go