apache / avro-rs

Rust SDK for Apache Avro - a data serialization system.
https://avro.apache.org/
Apache License 2.0
26 stars 11 forks source link

Schema representation - 'importing' types #64

Open chupaty opened 2 days ago

chupaty commented 2 days ago

The specification states the following:

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used (“before” in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed > to come “before” the messages attribute.)

This means that the following schema is valid (the A type is defined in b_field_one and the type is not defined with b_field_two):

{
  "name":"B",
  "type":"record",
  "fields":[
    {
      "name":"b_field_one",
      "type":{"name":"A","type":"record","fields":[]}
    },
    {
      "name":"b_field_two",
      "type":{"name":"A"}
    },    
  ]
}

This crate makes it easy to load a schemata in the form of a sequence (Schema::parse_list) of schema where:

However I can't find a way to output an individual schema in a schemata that is 'complete' (ie does not depend on other schemas, and otherwise complies with the rules). For example:

   let schema_str_1 = r#"{
        "name": "A",
        "doc": "A's schema",
        "type": "record",
        "fields": [
        ]
    }"#;
    let schema_str_2 = r#"{
        "name": "B",
        "doc": "B's schema",
        "type": "record",
        "fields": [
            {"name": "b_field_one", "type": "A"},
            {"name": "b_field_two", "type": "A"}
        ]
    }"#;
    let schema_strs = [schema_str_1, schema_str_2];
    let schemata = Schema::parse_list(&schema_strs)?;

    for d in schemata {
        println!("{}",d.canonical_form());
    }

Gives the following canonical schemas (for A I think this is OK, B is problematic as there is no definition for A):

{"name":"A","type":"record","fields":[]}
{"name":"B","type":"record","fields":[{"name":"b_field_one","type":"A"},{"name":"b_field_two","type":"A"}]}

I believe the canonical form for B should actually be:

{"name":"B","type":"record","fields":[{"name":"b_field_one","type":{"name":"A","type":"record","fields":[]}},{"name":"b_field_two","type":"A"}]}

Do we have a way of producing this 'correct' form? This is necessary for fingerprint calculation and some schema registry interactions.

chupaty commented 1 day ago

I've raised a PR to provide this here: https://github.com/apache/avro-rs/pull/66

chupaty commented 5 hours ago

Sorry about the noise - I've added a fix for nested Refs (where a schema depends on another that depends on another), and re-opened the PR. Added a test for this as well.

The goal here is ultimately to support interop with schema registries where each schema is stored independently of other schemata in the registry.

I haven't changed the current functionality (Schema::canonical_form()), EXCEPT, you'll notice a change in the two test cases:

In both cases, the 'expected' canonical form was (I believe) incorrect, as they both include two (duplicate) definitions for a type.