beam-community / avro_ex

An Avro Library that emphasizes testability and ease of use.
https://hexdocs.pm/avro_ex/AvroEx.html
67 stars 27 forks source link

Difference between encoding/decoding of avro_ex and avrora/erlavro #82

Closed LostKobrakai closed 1 year ago

LostKobrakai commented 1 year ago
Mix.install([:avrora, :avro_ex])
import ExUnit.Assertions

template = """
{
  "type": "record",
  "name": "format",
  "fields": [
    {
      "name": "amounts",
      "type": {
        "type": "array",
        "items": {
          "type": "record",
          "name": "amount",
          "fields": [
            {
              "name": "amount",
              "type": "long"
            },
            {
              "name": "type",
              "type": {
                "type": "enum",
                "name": "amount_type",
                "symbols": [
                  "amount_type_a",
                  "amount_type_b"
                ]
              }
            }
          ]
        }
      }
    }
  ]
}
"""

Avrora.start_link()

{:ok, schema_avrora} = Avrora.Schema.Encoder.from_json(template)
{:ok, schema_avro_ex} = AvroEx.decode_schema(template)

data = %{ 
  amounts: [
    %{type: "amount_type_a", amount: 10_000_000_000},
    %{type: "amount_type_b", amount: 12_000_000_000}
  ]
}

{:ok, payload_avrora} = Avrora.Codec.Plain.encode(data, schema: schema_avrora)
{:ok, payload_avro_ex} = AvroEx.encode(schema_avro_ex, data)

assert payload_avrora == payload_avro_ex
** (ExUnit.AssertionError) 

Assertion with == failed
code:  assert payload_avrora == payload_avro_ex
left:  <<3, 24, 128, 144, 223, 192, 74, 0, 128, 224, 139, 180, 89, 2, 0>>
right: <<4, 128, 144, 223, 192, 74, 0, 128, 224, 139, 180, 89, 2, 0>>

The difference seems to be in the first few bytes.

I want to call out that this might also be an issue with avrora, but we're considering switching and I ran a chunk of our existing data against avro_ex to see what breaks.

LostKobrakai commented 1 year ago

Reading up on this it seems there are two ways to encode blocks for arrays. They can either start with a positive long encoded integer stating the number of array items in the block, or it can start with a negative long encoded integer stating the negated number of array items in the block followed by another long encoded integer stating the bytesize of the whole block. erlavro defaults to the latter option: https://github.com/klarna/erlavro/blob/master/src/avro_binary_encoder.erl#L150-L151

https://avro.apache.org/docs/1.11.1/specification/#arrays-1