AbsaOSS / ABRiS

Avro SerDe for Apache Spark structured APIs.
Apache License 2.0
230 stars 75 forks source link

java.io.NotSerializableException when using JavaSerializer in v5.1.0, v6.1.0 #274

Closed kevinwallimann closed 2 years ago

kevinwallimann commented 2 years ago

Description

With the new configurable schema converter feature (#268, #269), the class DefaultSchemaConverter is instantiated by default as member variable schemaConverter in AvroDataToCatalyst. Even though AvroDataToCatalyst as a case class is serializable by default, serialization fails when using the JavaSerializer with the following error message

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: za.co.absa.abris.avro.sql.DefaultSchemaConverter
Serialization stack:
    - object not serializable (class: za.co.absa.abris.avro.sql.DefaultSchemaConverter, value: za.co.absa.abris.avro.sql.DefaultSchemaConverter@1ce2ce83)
    - field (class: za.co.absa.abris.avro.sql.AvroDataToCatalyst, name: schemaConverter, type: interface za.co.absa.abris.avro.sql.SchemaConverter)
    - object (class za.co.absa.abris.avro.sql.AvroDataToCatalyst, from_avro(value#647, (readerSchema,{"type":"record","name":"e2etest","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]})))
    - field (class: org.apache.spark.sql.catalyst.expressions.IsNotNull, name: child, type: class org.apache.spark.sql.catalyst.expressions.Expression)
    - object (class org.apache.spark.sql.catalyst.expressions.IsNotNull, isnotnull(from_avro(value#647, (readerSchema,{"type":"record","name":"e2etest","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]}))))
    - field (class: org.apache.spark.sql.execution.FilterExec, name: condition, type: class org.apache.spark.sql.catalyst.expressions.Expression)
    - object (class org.apache.spark.sql.execution.FilterExec, Filter isnotnull(from_avro(value#647, (readerSchema,{"type":"record","name":"e2etest","fields":[{"name":"field1","type":"string"},{"name":"field2","type":"int"}]})))

How to fix

~Add Serializable trait to SchemaConverter to trait.~ Make schemaConverter lazy