AbsaOSS / ABRiS

Avro SerDe for Apache Spark structured APIs.
Apache License 2.0
229 stars 75 forks source link

How to handle deserialization issues in from_avro? #182

Closed nsanglar closed 3 years ago

nsanglar commented 3 years ago

Hello!

I am currently facing the following issue:

  1. We get avro records from a topic that we read with spark streaming (2.4.x)
  2. One of the avro record contains some malformed byte array (the type is bytes with logical type decimal)
  3. This makes the deserialization fail, and the job cannot commit the processed offset since it aborts.
  4. Upon restart, the job re-reads the faulty data and cannot go further

I would like to be able to ignore such cases where deserialization fails, but am struggling to find a nice solution. Would you have any idea?

cerveada commented 3 years ago

Hello, sorry right now there is no option in Abris to solve that. I created a ticket for it #183. For now the only option is to detect and replace/fix that row before Abris is called.

moyphilip commented 3 years ago

@nsanglar hey do you have a solution for your problem? I ran into a similar issue.

nsanglar commented 3 years ago

@moyphilip I currently have a fork of the project in which I apply a different logic here: https://github.com/AbsaOSS/ABRiS/blob/6520a91962708189b340c87b9dfa1c482f0d7480/src/main/scala/za/co/absa/abris/avro/sql/AvroDataToCatalyst.scala#L82

I just don't throw an exception but return an empty row and log some error. I guess this is quite specific to my use case, so I am not sure this would be appropriate to incoporate it upstream.

ScaddingJ commented 2 years ago

@nsanglar how did you manage to return an empty row in your use case? I have a reason to do the same in mine.