confluentinc / schema-registry

Confluent Schema Registry for Kafka
https://docs.confluent.io/current/schema-registry/docs/index.html
Other
2.19k stars 1.11k forks source link

[Avro Deserializer] ClassCastException when reading field of array of strings contained in a top-level union schema #2702

Open xc-cre opened 1 year ago

xc-cre commented 1 year ago

When reading events using the following schema we one gets a ClassCastException as elements of someStrings are of type Utf8 not String. We're using SpecificRecords and the Avro Java Generator maven plugin with option

<configuration>
    <stringType>String</stringType>
</configuration>

Which adds "avro.java.string": "String" properties to all String type avro fields to the schema included in the generated class. In the Producer we set property avro.remove.java.properties = true which strips these properties at runtime before calling the registry. So we can keep the schemas on the registry java-agnostic and don't have to include this language-specific behaviour there. The problem now comes with the consumer which takes the writer schemas as it's reader schema because schema type is UNION and this schema does not include these properties. So the deserializer uses Avro default String class Utf8. The specific record RecordContainingArrayOfStrings (which is returned by the deserializer) is defined as type String not Utf8 though so we get a ClassCastException when accessing these fields as List<String>.

[
  {
    "namespace": "test",
    "type": "record",
    "name": "RecordContainingArrayOfStrings",
    "fields": [
      {
        "name": "someStrings",
        "type": {
          "type": "array",
          "items": "string"
        }
      }
    ]
  }
]

(We're using suggested top-level unions as suggested in Putting Several Event Types in the Same Topic – Revisited to handle multiple event types in a single topic)

I currently see two possible solutions/workarounds for this: 1) Either we have to add these java-specific properties to the schema registered on the registry. That wouldn't be ideal as we'd like to keep them language agnostic (also would create hundreds of new schema versions) and would potentially break producers as they would not be able to find Schema anymore on the registry as long as they have avro.remove.java.properties set to true. 2) Or KafkaAvroDeserializer will add opposite operation of avro.remove.java.properties which adds these java properties to string types before using it as reader schema 3) (Also mentioned in Issue 2704) Deserializer somehow uses the actual SpecificRecord instance used in the Union (not the global class instance SpecificData.get()) as this one contains a schema with these properties as added by Avro Generator maven plugin

Any ideas how to fix this?

m8719-github commented 11 months ago

There are two KafkaAvroDeserializerConfig properties you need to set on the consumer:

KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG = true
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_VALUE_TYPE_CONFIG = <FQCN of the avro generated POJO>

These are meant to instruct the deserializer to use the schema embedded in the Avro generated POJO which will include the "avro.java.string": "String" annotations and deserialize string avro types as java.lang.String

S1M0NM commented 7 months ago

We observe the same behavior when we update the io.confluent.kafka-avro-serializer dependency from version 7.3.1 to a higher one. (tested with 7.3.3, 7.4.2, 7.5.1 and 7.5.3).

Have you found a sensible solution for this? At the moment we are only left with the older dependency, which also works.

Muenze commented 5 months ago

Does anyone know if this has already been solved? We have communicated in our company that upgrading to 7.5.1 is very dangerous and nearly all the teams that did it nontheless had huge incidents afterwards. My team now wants to try 7.6.+ We halted the merge request and will investigate that on qa branches a bit longer befor going the way to prod

S1M0NM commented 5 months ago

@Muenze since we were also affected by the problem, I was able to reproduce it with an example.

Cause: The problem occurs because since version 7.3.3 the AvroDeserializer respects the configuration for use.latest.version and does not ignore it as in previous versions. In previous versions, the schema was taken into account, which classes generated from Avro schemas contained and usually also had the <stringType>String</stringType> defined here. The schemas also included "avro.java.String":"String" which ensured that we also got java.lang.String in collections instead of org.apache.avro.util.Utf8. By taking use.latest.version into account, the schema registry is always queried and the schema existing there is used. If there is no "avro.java.String":"String" in the schema, then the elements will be returned as org.apache.avro.util.Utf8.

From this I was able to derive three solutions:

  1. Inserting "avro.java.String":"String" into all schemas in the registry (was not practical for us)
  2. In the avro-maven plugin, switch from <stringType>String</stringType> to <stringType>CharSequence</stringType>, since both the java.lang.String and org.apache.avro.util.Utf8 classes have the interface CharSequence implemented and this means that ClassCastException no longer occurs
  3. Don't use use.latest.version=true if its not required

I hope these suggested solutions are helpful to you.