MaterializeInc / materialize

The Cloud Operational Data Store: use SQL to transform, deliver, and act on fast-changing data.
https://materialize.com
Other
5.72k stars 466 forks source link

Avro unions are not supported in arrays (and probably not in maps either) #8917

Open umanwizard opened 2 years ago

umanwizard commented 2 years ago

Materialize can't ingest a topic with this schema:

{
    "type": "record",
    "name": "r1",
    "fields": [
        {
            "name": "f1",
            "type": {
                "type": "array",
                "items": ["null", "string"]
            }
        }
    ]
}

The error message:

ERROR:  validating avro value schema: Unsupported type in schema: Union(UnionSchema { schemas: [Piece(Null), Piece(String)], anon_variant_index: {Null: 0, String: 1}, named_variant_index: {} })

What's happening here is that in validate_schema_2, we force the Avro type to be a type that can be interpreted by Materialize, which unions can't (Materialize has no sum types). In other contexts, such as record fields, we hack in support as follows:

  1. If the union is ["null", "something"], we interpret it as a nullable something (the one special case of sum type that SQL actually supports)
  2. Otherwise, we expand an n-branch union to N columns, each of which is null if and only if the corresponding union branch is uninhabited.

But we're not going into that code path for the type of array elements.

We could fix that, or (my preference) we could finally add real sum types to Materialize, but I'm not sure how much support from the SQL team that needs. (Also, it'd be a pretty serious BC break)

justjake commented 2 years ago

The bummer here is that Union(UnionSchema { schemas: [Piece(Null), Piece(String)] is the default representation of Postgres's string[] / uuid[] native data type in Debezium w/ the Avro converter. There's no way in Postgres DDL to specify "array of non-nullable values"; the best we can do is a check constraint on the array column that verifies no array element is null.

morsapaes commented 2 years ago

@justjake, an update ICYMI: the fix for nullability-pattern unions was released in v0.26.4. This should make PostgreSQL array types tractable downstream!

justjake commented 2 years ago

Fantastic!

benesch commented 6 months ago

Thanks for sending this our way, @chaas! Agree that this is a storage issue and not an adapter issue. I'm moving this off the storage team board since we've already got it triaged in our mega tracker issue: https://github.com/MaterializeInc/materialize/issues/24755.

benesch commented 2 months ago

Linking in an option about how to do this without proper sum types: https://github.com/MaterializeInc/materialize/issues/28265#issuecomment-2234988861