apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.88k stars 4.26k forks source link

[Bug]: Python SDK avro to beam schema conversion defaults to Any type for nullable atomic union fields #30750

Open benkonz opened 7 months ago

benkonz commented 7 months ago

What happened?

The Python SDK's avro_type_to_beam_type function maps all Union types to:

type {
  nullable: true
  logical_type {
    urn: "beam:logical:pythonsdk_any:v1"
  }

which results in this exception:

java.lang.IllegalArgumentException: Unexpected type_info: TYPEINFO_NOT_SET
INFO:   at org.apache.beam.sdk.schemas.SchemaTranslation.fieldTypeFromProtoWithoutNullable(SchemaTranslation.java:479)
...

when that schema tries to get loaded into Beam in a SqlTransform. The code should be smart enough to properly encode nullable atomic avro types such as: 'type': ['null', 'string'] into the corresponding beam type and back:

type {
  nullable: true
  atomic_type: STRING
}

if no such nullable type conversion is possible, we can default back to the Any type until a proper union coder is added to the beam Python SDK.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

benkonz commented 7 months ago

.take-issue

benkonz commented 6 months ago

the PR has been merged! Will close when it has been included into the next Beam release