Comparing schemas in Lancaster when recurring records can be processed in different order

bartkl commented 1 year ago

Hey Chad,

TL;DR: How do I compare two schemas that should be equivalent, but might have been processed slightly differently due to the order of the input varying each run?

I'm writing an integration test for our application, where I'm basically doing this:

Generate an Avro schema using Lancaster and write it to a JSON file
Copy this file to my testdata directory, to use it as data to expect (this is to quickly test the round-trip)
Generate an Avro schema again, the same way, using the same input. and check if it is equal to the first one

To check for equality I first tried parsing the JSON files as hash maps, but there the check failed. I noticed it's because the order in which some recurring record occur differs between runs. (This might have to do with the graph database from which the input comes, I'm not really sure.)

Anyways, before delving in too deeply, I basically wanted to ask you how you would go about safely comparing two schemas, given that they can be processed diffrerently in terms of where recurring records occur first, and therefore where references to them will be defined, as well as perhaps other things.

I hoped I could use json->schema and compare the two schemas that way. That still didn't give me equality though. I'm not sure why and that's probably my question for you 🙂. Finally I tried taking fingerprints of those schema objects, still to no avail.

I will be on holiday the upcoming week, and I'll be back at work at Oct 31st. So be sure not to haste 😉.

Thanks, Bart

chadharrington commented 1 year ago

Hi Bart, I hope you had a nice holiday. I'd like to understand your question better. I don't understand this part:

it's because the order in which some recurring record occur differs between runs

Can you provide an example?

bartkl commented 1 year ago

Hey Chad,

My holiday was great, thanks! Today is my first working day, so I will have to catch up with a few things. Once I've done that I'll try to come up with an example that clarifies my question.

In the meantime, could you perhaps answer more generally how you would compare two Avro schemas for equivalence (context: unit test)?

bartkl commented 1 year ago

Here's an example.

First time I generate the schema:

{"name" "B",
 "type" "record",
 "fields"
 [{"name" "fromBtoDButSomehowDifferent",
   "type"
   {"name" "D",
    "type" "record",
    "fields"
    [{"name" "def", "type" ["null" {"type" "array", "items" "float"}], "default" nil, "doc" ""}
     {"name" "id", "type" "string", "default" "", "doc" "Identificatie"}],
    "doc" "This is yet another class"},
   "default" {"def" nil, "id" ""},
   "doc" "Association from B to D but somehow different"}
  {"name" "fromAtoC",
   "type"
   {"name" "C",
    "type" "enum",
    "symbols" ["INDIVIDUAL_3" "INDIVIDUAL_2" "INDIVIDUAL_1"],
    "default" "INDIVIDUAL_3",
    "doc" "This is a class with named elements"},
   "default" "INDIVIDUAL_3",
   "doc" ""}
  {"name" "fromBtoD", "type" ["null" "D"], "default" nil, "doc" "Association from B to D"}
  {"name" "abc", "type" ["null" {"type" "array", "items" "string"}], "default" nil, "doc" ""}
  {"name" "bcd", "type" {"type" "array", "items" "double"}, "default" [], "doc" ""}
  {"name" "id", "type" "string", "default" "", "doc" "Identificatie"}],
 "doc" "This is a sub-class"}

Second time:

{"name" "B",
 "type" "record",
 "fields"
 [{"name" "fromAtoC",
   "type"
   {"name" "C",
    "type" "enum",
    "symbols" ["INDIVIDUAL_3" "INDIVIDUAL_2" "INDIVIDUAL_1"],
    "default" "INDIVIDUAL_3",
    "doc" "This is a class with named elements"},
   "default" "INDIVIDUAL_3",
   "doc" ""}
  {"name" "fromBtoD",
   "type"
   ["null"
    {"name" "D",
     "type" "record",
     "fields"
     [{"name" "def", "type" ["null" {"type" "array", "items" "float"}], "default" nil, "doc" ""}
      {"name" "id", "type" "string", "default" "", "doc" "Identificatie"}],
     "doc" "This is yet another class"}],
   "default" nil,
   "doc" "Association from B to D"}
  {"name" "bcd", "type" {"type" "array", "items" "double"}, "default" [], "doc" ""}
  {"name" "fromBtoDButSomehowDifferent",
   "type" "D",
   "default" {"def" nil, "id" ""},
   "doc" "Association from B to D but somehow different"}
  {"name" "id", "type" "string", "default" "", "doc" "Identificatie"}
  {"name" "abc", "type" ["null" {"type" "array", "items" "string"}], "default" nil, "doc" ""}],
 "doc" "This is a sub-class"}

Note how the order of the fields is different. I'm not sure yet why this happens. It could have to do with retrieval from the graph store (Asami), or perhaps during processing by Lancaster (?).

Of course both of these schemas are equivalent, but I don't know how to test for it given this sequential order issue.

bartkl commented 1 year ago

From the spec:

A record is encoded by encoding the values of its fields in the order that they are declared. In other words, a record is encoded as just the concatenation of the encodings of its fields. Field values are encoded per their schema.

That's probably not news to you, but it is to me. It seems the problem, then, is that I have to guarantee a fixed order when writing the fields. Would you agree?

bartkl commented 1 year ago

Alright, I've managed to fix the order in which I traverse the input data so that I guarantee the order in which the record fields are built.

Using the Apache Avro Java library, this works, i.e. generating the schema multiple times yields equivalence (I use Schema.Parser.parse() and then compare them using (= schema1 schema2)).

However, using l/json->schema and then comparing the objects does not work. I think the only difference is:

   :serializer -#<deercreeklabs.lancaster.utils$fn__8578$serialize__8584@263c1d0f>
               +#<deercreeklabs.lancaster.utils$fn__8578$serialize__8584@3254342>}

Is using = the way to compare Lancaster schemas? Or am I misunderstanding/abusing something?

Edit: I can of course simply compare the JSON strings, assuming Lancaster outputs them deterministically. That's probably simple, but it feels flakey. What do you think?

chadharrington commented 1 year ago

I apologize for the delay in responding, but you figured out the issue in the meantime. ;-) Schemas with the same fields in a different order are not the same.

To compare two schemas, you can compare either the Parsing Canonical Form (which is json) or the fingerprint. Comparing the return value of l/pcf (which returns the Parsing Canonical Form) is always correct and safe. Comparing the JSON may or may not work, depending on if the JSON is in parsing canonical form or not.

For some situations, it may be more convenient to compare using the fingerprint. l/fingerprint64 returns a long, which is easily compared (at least in clojure; cljs is another matter). However, the 64-bit Rabin fingerprint may not be sufficiently unique if you have millions of schemas. l/fingerprint128 or l/fingerprint256 have much lower chances of collision. See the spec for more information about that. In lancaster, l/fingerprint128 and fingerprint256 return byte arrays, which are not directly comparable. You can use the baracus library to work with byte arrays. ba/equivalent-byte-arrays? can compare two byte arrays directly, or you can call ba/byte-array->b64 to get a base-64 encoded string of the byte array and then compare that. The base-64 method is handy if you need to store the fingerprints as keys in a map or a set (byte arrays should never be used as keys).

Let me know if you have any other questions; I am always happy to help.

Edit: I updated this comment to use l/pcf, instead of l/json.

bartkl commented 1 year ago

No need to apologize, I'm very happy with your help.

We work with small schemas and certainly don't have millions, so perhaps the l/pcf way is the easiest here. Saves me a dependency ;).

Thanks again! Bart

deercreeklabs / lancaster

Comparing schemas in Lancaster when recurring records can be processed in different order #24