[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions

agchang commented 4 months ago

Describe the enhancement requested

I am interested in support for schema inference in the RecordFromJSON and TableFromJSON functions, as these currently require an arrow.Schema up front. I can try to contribute this if people think it makes sense. I noticed for CSV, there is NewInferringReader which just assumes the type of the first row.

Component(s)

Go

joellubi commented 4 months ago

Hi @agchang, thanks for opening this issue. I think this feature very well may make sense to implement, and we would welcome your contribution if you decide to do so!

I'll write down a few of my thoughts because something like this will generally involve some tradeoffs:

Unlike in CSV where changing the number of columns between rows is invalid, JSON allows changes to the "schema" element-by-element. This can mean adding/removing a field between rows or even having entirely disjoint sets of fields.

A simple approach may be to set the schema using the fields from the first row. Fields that are missing in subsequent rows can be set to null, fields that are added can be ignored.
A more robust but more complicated approach would be to grow the schema row-by-row, as new fields are encountered. The resulting schema would be the union of all fields encountered across the rows.
- This may be impractical with a single pass over the JSON, as it would require instantiating and backfilling arrays every time a new field is encountered and wouldn't work at all if writing batches.
- Alternatively, two passes can be taken. The first will build up a list of all fields present across all rows with their inferred types. The second pass can use this to set a fixed schema and simply reuse the existing *FromJSON() functions.

If we want to go with the latter approach, my recommendation would be to focus on a dedicated implementation of the "first-pass" which infers an Arrow schema from JSON. We can then just use the output of this function as input to the existing ones:

func InferSchemaFromJSON(r io.Reader) (*arrow.Schema, error) { ... } // This needs to be implemented

func main() {
  jsonBlob := `{ ... }`

  schema, err := InferSchemaFromJSON(strings.NewReader(jsonBlob))
  if err != nil {
    log.Fatal(err)
  }

  table, err := TableFromJSON(memory.DefaultAllocator, schema, []string{jsonBlob})
  if err != nil {
    log.Fatal(err)
  }

  // do table stuff
}

loicalleyne commented 1 week ago

@agchang I made bodkin to address the schema generation issue.

apache / arrow-go

[Go] Schema inference on `RecordFromJSON` and `TableFromJSON` functions #30

Describe the enhancement requested

Component(s)