apache / arrow-go

Official Go implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
50 stars 9 forks source link

[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30

Open agchang opened 4 months ago

agchang commented 4 months ago

Describe the enhancement requested

I am interested in support for schema inference in the RecordFromJSON and TableFromJSON functions, as these currently require an arrow.Schema up front. I can try to contribute this if people think it makes sense. I noticed for CSV, there is NewInferringReader which just assumes the type of the first row.

Component(s)

Go

joellubi commented 4 months ago

Hi @agchang, thanks for opening this issue. I think this feature very well may make sense to implement, and we would welcome your contribution if you decide to do so!

I'll write down a few of my thoughts because something like this will generally involve some tradeoffs:

Unlike in CSV where changing the number of columns between rows is invalid, JSON allows changes to the "schema" element-by-element. This can mean adding/removing a field between rows or even having entirely disjoint sets of fields.

If we want to go with the latter approach, my recommendation would be to focus on a dedicated implementation of the "first-pass" which infers an Arrow schema from JSON. We can then just use the output of this function as input to the existing ones:

func InferSchemaFromJSON(r io.Reader) (*arrow.Schema, error) { ... } // This needs to be implemented

func main() {
  jsonBlob := `{ ... }`

  schema, err := InferSchemaFromJSON(strings.NewReader(jsonBlob))
  if err != nil {
    log.Fatal(err)
  }

  table, err := TableFromJSON(memory.DefaultAllocator, schema, []string{jsonBlob})
  if err != nil {
    log.Fatal(err)
  }

  // do table stuff
}
loicalleyne commented 1 week ago

@agchang I made bodkin to address the schema generation issue.