apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.89k stars 3.38k forks source link

[Go][Parquet] Writing a Parquet file from a slice of structs #37807

Open tschaub opened 9 months ago

tschaub commented 9 months ago

Describe the usage question you have. Please include as many useful details as possible.

I'm hoping to get suggestions on the best way to use the library to write a Parquet file given a slice of structs (Golang structs instead of Arrow's array.Struct).

The parquet.NewSchemaFromStruct() function looks like a useful starting point to generate a Parquet schema from a struct.

The pqarrow.NewFileWriter() function is helpful for creating a writer. And I can see how to convert a Parquet schema to an Arrow schema with the pqarrow.FromParquet() function.

The writer.WriteBuffered() method looks like a convenient way to write an Arrow record. So the gap is then to get from a slice of structs to the Arrow record.

I was looking for something like array.RecordFromSlice(). The array.RecordFromStructArray() looks useful, but I think I would have to do a fair bit of reflection to work with the struct builder. It looks like array.RecordFromJSON() does the same sort of reflection that I would have to do to use the struct builder.

I know it is not efficient, but I see that I can encode my struct slice as JSON and then generate a record from that. Here is a working test that uses the pqarrow.FileWriter to write a slice of structs as Parquet:

package pqarrow_test

import (
    "bytes"
    "encoding/json"
    "strings"
    "testing"

    "github.com/apache/arrow/go/v14/arrow/array"
    "github.com/apache/arrow/go/v14/arrow/memory"
    "github.com/apache/arrow/go/v14/parquet"
    "github.com/apache/arrow/go/v14/parquet/pqarrow"
    "github.com/apache/arrow/go/v14/parquet/schema"
    "github.com/stretchr/testify/require"
)

func TestFileWriterFromStructSlice(t *testing.T) {
    type Row struct {
        Name  string `parquet:"name=name, logical=String" json:"name"`
        Count int    `parquet:"name=count" json:"count"`
    }

    rows := []*Row{
        {
            Name:  "row-1",
            Count: 42,
        },
        {
            Name:  "row-2",
            Count: 100,
        },
    }

    data, err := json.Marshal(rows)
    require.NoError(t, err)

    parquetSchema, err := schema.NewSchemaFromStruct(rows[0])
    require.NoError(t, err)

    arrowSchema, err := pqarrow.FromParquet(parquetSchema, nil, nil)
    require.NoError(t, err)

    rec, _, err := array.RecordFromJSON(memory.DefaultAllocator, arrowSchema, strings.NewReader(string(data)))
    require.NoError(t, err)

    output := &bytes.Buffer{}

    writer, err := pqarrow.NewFileWriter(arrowSchema, output, parquet.NewWriterProperties(), pqarrow.DefaultWriterProps())
    require.NoError(t, err)

    require.NoError(t, writer.WriteBuffered(rec))
    require.NoError(t, writer.Close())
}

Again, I know there are more efficient ways to go from a slice of structs to a Parquet file. I'm just looking for advice on the most "ergonomic" way to use this library to do that. Am I missing a way to construct an Arrow record from a slice of structs? Or should I not be using the pqarrow package at all to do this?

Component(s)

Go, Parquet

zeroshade commented 9 months ago

So, first and foremost: You're completely right, there isn't currently a good / efficient way to convert a slice of structs to an arrow record / struct array. My initial reaction would be to suggest converting the structs to JSON and then using RecordFromJSON, but you'd still have to create the reflection to actually generate the Arrow Schema (since all of the FromJSON methods require providing an existing arrow schema rather than having implemented the reflection to generate one myself). But you're write that this would be even less efficient.

The most "ergonomic" way to do this would likely to bypass the conversion to arrow in the first place and just use the column chunk writers directly from the file package.

That all said, it would probably be pretty useful if we did implement a full reflection based way of converting a struct to an arrow schema (like already exists for converting a struct to a parquet schema) or instantiate a RecordBuilder or StructBuilder from a struct and then allow appending a slice of that struct to the builder.

chelseajonesr commented 9 months ago

@tschaub I have an initial version of this using reflection here, in case this is helpful: https://github.com/chelseajonesr/rfarrow

I'm using this for a specific use case so some conversions may not have been tested; feel free to let me know if anything doesn't work.

tschaub commented 9 months ago

Looks useful, @chelseajonesr.

My only real current use case has been to create Parquet data for tests. I've written a test.ParquetFromJSON() function for this purpose. Maybe also specific to my use case, but this relies on incrementally building up a schema based on a configurable number of input (JSON) rows - to allow for cases where nulls may be present in early rows and the appropriate field type isn't known until reading more data. So I have an Arrow schema builder for this. This does't yet cover all the types you might encounter with an arbitrary struct - I'm just adding support for the cases I need to handle.

So while I think it could be useful to have something in this library to generate Arrow data from a slice of structs (to compliment the current parquet.NewSchemaFromStruct() function), I just wanted to say that I don't have an urgent need for this now. I'll close this unless someone else thinks it is a worthwhile issue to keep open.

eest commented 9 months ago

I found this issue trying to figure out how to write out a parquet file based on a go struct as well. In my case I already have some data in a arrow data structure which I have been able to write out via the pqarrow package, but then I had a need to also write out some other data where I am not using arrow data structures for the storage. Given the separation between the pqarrow convenience package and the more general file package I was not sure if the latter aims to be a "general" parquet writer for use even if arrow is not used for the input data.

As a comparison the https://github.com/xitongsys/parquet-go package allows you to basically do:

pw, err := writer.NewParquetWriterFromWriter(..., new(myStruct), ...)
pw.Write(myStructInstance)

I was trying to see if there was something like that present in file but was not able to find it. Of course it would make sense if what I was looking for is out of scope for this project and I should just use another package for that, like the one above.

zeroshade commented 9 months ago

@eest You're correct that the file package is intended to be a "general" parquet writer for use even if arrow is not used for the input data. The idea was that all "arrow" specific things would be contained in the pqarrow package, while the rest are general parquet packages.

Issues I had encountered with https://github.com/xitongsys/parquet-go was actually the motivating factor which led to me creating the parquet package here, and is why I had created the initial methods that let you create a parquet schema from a struct / vice versa. Having a writer that will allow you to write instances of a struct was out of scope when I was originally building this package, but given the nature of Go it does seem like a reasonable addition to get written. I just never did it because I didn't see any interest in that until now.

If either of you would be willing to put together a PR for adding this functionality, I'd happily help iterate on it with you. Unfortunately I don't have the bandwidth currently to work on adding that functionality myself at this time.

zeroshade commented 9 months ago

I could also see @chelseajonesr's repo as definitely a good starting point for this kind of functionality and would happily accept a refined version into the Arrow lib itself if a PR is made

chelseajonesr commented 9 months ago

@zeroshade Sure, I'd be happy to - will also need to fill in a few data types I left out and expand the tests; should be able to start on that next week.