Support unsigned number as logical types in codegen from Avro

rchoffardet commented 6 months ago

The avro specification allows the presence of the logicalType property which can be used to represent a derived type from the original type. It's already used for Guid which is one of the possible logical types of the type string.

I would love to see unsigned numbers supported in such fashion

For example, the following field:

{
  "name": "myField",
  "type": [
    {
      "type": "int",
      "logicalType": "uint"
    }
  ],
  "default": 0
}

would produce this code:

 public uint MyField { get; set; }

I've already forked this repo to do this for my job and would gladly submit a PR. (I've only done it for float and integer, but it could be generalized)

Please tell me what you think :)

dstelljes commented 6 months ago

See https://github.com/ch-robinson/dotnet-avro/discussions/173 for a related discussion—we probably don’t want to introduce logical types that aren’t defined by the spec, but we’d be open to a custom property that hints the .NET type to use.

rchoffardet commented 6 months ago

Doesn't the spec allow to specify custom logical types?
Also, as this library is designed for dotnet, I find that it would make perfect sense to implement base types as logical types. Would it not? If needed it could be System.UInt32 instead of uint.

dstelljes commented 6 months ago

The spec does allow for custom logical types (and indeed, “Language implementations must ignore unknown logical types when reading”), but we don’t know how the spec might evolve. If we introduce a type like uint, we risk conflicting semantics if the spec ever chose to introduce a logical type with the same name.

rchoffardet commented 6 months ago

Would it, though? If the specification introduces an uint logical type, would be different in java than in dotnet ? C/C++ ? I think we can safely presume that uint is universal.

dstelljes commented 6 months ago

How should negative values be handled? 31 bits or 32? Does int even make sense as an underlying type since Avro integers are zig-zag encoded? It’s impossible to predict how the spec would answer those types of questions. If we stick to something like an avro.dotnet.type property, we know we’ll never conflict with the spec.

rchoffardet commented 6 months ago

The spec says that:

A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute. Language implementations may choose to represent logical types with an appropriate native type, although this is not required.

So, an uint with a value superior to the signed int32 max value will be serialized as a negative number but deserialized into an uint again if the deserializer supports it. If the deserializer does not support it, it would be deserialized as a signed int. Also, to answer your question: int is precisely defined by the spec to be a 32-bit signed integer, so yes, it makes sense to me. Even though zig-zag encoding could theoretically produce infinite integers, the spec limits them to 32 bits (int) and 64 bits (long).

A new property avro.dotnet.type is a good idea to reduce future possible conflicts with the spec. But it breaks future interoperability between deserializers. (As does System.UInt32 break interoperability between the deserializers of different languages).

I get that logical types for strings can be a difficult topic, but I don't see how it applies to numbers, especially integers.

dstelljes commented 6 months ago

yes, it makes sense to me

Same, but what I’m getting at is that we don’t know how the spec will evolve. Maybe Chr.Avro introduces its own uint logical type on top of "int" tomorrow, but next month the spec introduces a uint type that decorates "fixed" schemas. Then we’d be sort of stuck; we could implement whatever the spec decides, but we couldn’t back out what we invented in the interim without breaking people.

But it breaks future interoperability between deserializers

Could you explain this more? avro.dotnet.type would just be a hint to codegen.

rchoffardet commented 6 months ago

Then we’d be sort of stuck

I'm sorry, I honestly don't understand why we would be stuck. We would have two ways to serdes uint: one based on the underlying fixed type and the other one based on the int type. Even if the spec introduces a new avro type uint, it wouldn't matter much (except the fact to have two different thing to achieve the same goal).

The only issue I see, is if the spec introduces a logical type uint based on int that wouldn't do the same thing as ours. But given the restriction on the logical type, I don't see how they could.

Could you explain this more? avro.dotnet.type would just be a hint to codegen.

What I meant is that using the logical property enables us to communicate intention within the spec's scope (In a context where the deserialization libray isn't the same as the serialization's one). If we use the avro.dotnet.type we ensures that nobody will support this.

dstelljes commented 6 months ago

We would have two ways to serdes uint: one based on the underlying fixed type and the other one based on the int type.

If the spec says that uint can only apply to fixed, we're not spec-compliant.

If we use the avro.dotnet.type we ensures that nobody will support this.

Fundamentally, I think this is where the disagreement is—if a behavior should be supported by multiple Avro implementations, it should be introduced to the Avro spec and then implemented, not the other way around.

ch-robinson / dotnet-avro

Support unsigned number as logical types in codegen from Avro #297