beam-community / avro_ex

An Avro Library that emphasizes testability and ease of use.
https://hexdocs.pm/avro_ex/AvroEx.html
67 stars 27 forks source link

AvroEx.Schema.Parser - Hand-rolled schema parser #62

Closed davydog187 closed 2 years ago

davydog187 commented 2 years ago

The goal of this PR is to

  1. Directly support parsing Elixir terms
  2. Seamlessly parse Elixir terms or json
  3. Detailed error messages for invalid Avro schemas
  4. Add better support for all schema types defined in the Avro 1.11.0 Specification.

Closes #40 Closes #60
Closes #61 Closes #55

Schema features

davydog187 commented 2 years ago

I'm having some questions about the spec with respect to how full names are resolved, and what is and isn't allowed in terms of duplicates.

Testing with avro-js, I get the following results

Duplicate field names in a Record

❌ Not allowed

> avro.parse({type: "record", name: "a", fields: [{name: "a", type: "string"}, {name: "a", type: "string"}]})
Uncaught Error: duplicate a field name
    at new RecordType (/Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/schemas.js:1429:11)
    at /Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/schemas.js:136:14
    at Object.createType (/Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/schemas.js:137:7)
    at Object.parse (/Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/files.js:74:13)

Duplicate field names in separate Records

✅ Allowed

> avro.parse({type: "record", name: "a", fields: [{name: "a", type: "string"}, {name: "b", type: {type: "record", name: "inner", fields: [{name: "a", type: "string"}]}}]})
RecordType {
  _name: 'a',
  _aliases: [],
  _type: 'record',
  _fields: [
    Field { _name: 'a', _type: StringType {}, _aliases: [], _order: 1 },
    Field { _name: 'b', _type: [RecordType], _aliases: [], _order: 1 }
  ],
  _constructor: [Function: a] { getType: [Function (anonymous)] },
  _read: [Function: reada],
  _skip: [Function: skipa],
  _write: [Function: writea],
  _check: [Function: checka]
}

However, it does not seem to respect namespaces on Record fields, which tells me I'm either misunderstanding the spec, or this implementation has issues.

The spec reads

In record, enum and fixed definitions, the fullname is determined in one of the following ways:

A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X. A fullname is specified. If the name specified contains a dot, then it is assumed to be a fullname, and any namespace also specified is ignored. For example, use "name": "org.foo.X" to indicate the fullname org.foo.X. A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used. References to previously defined names are as in the latter two cases above: if they contain a dot they are a fullname, if they do not contain a dot, the namespace is the namespace of the enclosing definition.

Primitive type names have no namespace and their names may not be defined in any namespace.

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)

This leads me to believe the following assertions:

❌ Record fields should not contain duplicates ✅ Fields in different records can have the same name, only if they have a different namespace, and thus different fullname ✅ Namespaces are inherited from immediately enclosing definition ❌ Namespaces are not inherited from ancestors beyond parent

davydog187 commented 2 years ago

It seems that https://github.com/apache/avro/pull/1439 and https://github.com/apache/avro/pull/1573 seem to clarify the rules around namespace resolution. I'll dig into that more

davydog187 commented 2 years ago

Asking for feedback here https://github.com/apache/avro/pull/1439/files#r818203665

davydog187 commented 2 years ago

Opening this for review while I get over the final hurdle, namespace propagation. @doomspork if you want to give some review please do. Note that once the tests are passing and credo is happy, I will likely merge so I can move onto removing ecto and the rest of the cleanup, which will be very noisy