Closed kevinwallimann closed 2 years ago
This may potentially cause issue with serializability of the converter function. The config map used until now uses only serializable types, and that guarantees everything will be serializable at the end. With the function, the responsibility will be shifted to the user.
The testing of this is harder because spark running locally doesn't try to serialize the objects, but the one on the cluster does.
It seems functions can be serializable in Scala, but only when certain conditions are met https://stackoverflow.com/questions/47806670/how-to-serialize-functions-in-scala
Ok, I understand. In that case, I can create a trait around the function and only pass the fully qualified name of the custom implementation to the config map. Abris would load it then using the ServiceLoader. That's the same principle how Spark loads different data sources: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L640
That could work. Or use an object, but in that case the serializability requirement should be well documented. It's up to you.
Hi @cerveada I refactored the PR completely. Now I'm using the Service Provider Interface (SPI), as commented earlier. I tested it on our use-case on the cluster and it worked as expected. I added a section in the readme as well.
Background
Currently, SchemaConverters is used in AvroDataToCatalyst to convert an Avro schema to a Spark data type.
I have a workload that reads from a Kafka topic, adds / transforms some columns, and then writes that data back to another Kafka topic. The schema for the target Kafka topic is auto-generated (based on the Spark schema) and is registered outside ABRiS. The source schema contains default values that should be kept for the target schema, but with the SchemaConverters implementation, these default values are lost. Furthermore, some type conversions from Avro to Spark are not bijective, e.g. Avro types
BYTES
andFIXED
are both converted toDecimalType
(if their Avro logical type isDecimal
) or both toBinaryType
(if they have no Avro logical type). Without the original logical type, it is impossible to decide whetherDecimalType
should be converted back toBYTES
orFIXED
. TheSchemaConverters
implementation choosesFIXED
. This is undesirable if the source schema needs to be preserved for whatever reason. The same problem exists for Spark typeTimestampType
which is always converted to Avro type Long /TimestampMicros
, but it could as well be Long /TimestampMillis
.One solution to these problems is to store the extra information from the Avro schema to column metadata in Spark (
StructField
metadata, to be precise). However, to do this, a custom schema conversion is required.Feature
The
SchemaConverters
implementation may work for most use-cases, but there are scenarios in which a user wishes to convert an Avro schema in a custom way, e.g. writing default values into Spark metadata. Thus, I suggest making the schema conversion configurable, where a user can optionally pass a functionSchema => DataType
to the From Avro Config. The default is(schema: Schema) => SchemaConverters.toSqlType(schema).dataType
Other
This functionality should be backported to ABRiS v5 as well.