apache / pulsar-client-python

Apache Pulsar Python client library
https://pulsar.apache.org/
Apache License 2.0
51 stars 39 forks source link

Fetch writer schema to decode Avro messages #119

Closed BewareMyPower closed 1 year ago

BewareMyPower commented 1 year ago

Fixes https://github.com/apache/pulsar-client-python/issues/108

Motivation

Currently the Python client uses the reader schema, which is the schema of the consumer, to decode Avro messages. However, when the writer schema is different from the reader schema, the decode will fail.

Modifications

Add attach_client method to Schema and call it when creating consumers and readers. This method stores a reference to a _pulsar.Client instance, which leverages the C++ APIs added in https://github.com/apache/pulsar-client-cpp/pull/257 to fetch schema info. The AvroSchema class fetches and caches the writer schema if it is not cached, then use both the writer schema and reader schema to decode messages.

Add test_schema_evolve to test consumers or readers can decode any message whose writer schema is different with the reader schema.

shibd commented 1 year ago

Use this patch. Although flowing define will create two schemas, that's okay, right? It will use write schema of writing that message to deserialize the data.

class User(Record):
    name = String()
    age = Integer()
    @AllArgsConstructor
    @Getter
    static class User {
        private final String name;
        private final int age;
    }

Do we need to continue to solve this problem? https://github.com/apache/pulsar-client-python/issues/108#issuecomment-1488657932

BewareMyPower commented 1 year ago

Use this patch. Although flowing define will create two schemas, that's okay, right? It will use write schema of writing that message to deserialize the data.

Yes, it will create two schemas. But modifying the _sorted_fields and _required fields will cause breaking changes. If we have ways to avoid the breaking changes, maybe we don't need to make these changes. Or we can make the changes in the next release after the discussion in the mail list.