AbsaOSS / ABRiS

Avro SerDe for Apache Spark structured APIs.
Apache License 2.0
230 stars 75 forks source link

Handle malformed records by providing a PermissiveRecordExceptionHandler #328

Closed willianmrs closed 1 year ago

willianmrs commented 1 year ago

Title: Handle malformed records by providing a PermissiveRecordExceptionHandler

Problem: During the process of data ingestion, we noticed the application can fail if it encounters malformed Avro records. This type of failure was leading to interruptions in our data pipeline which can be problematic, particularly when dealing with large datasets.

Solution: I've implemented a 'PermissiveRecordExceptionHandler' that substitutes any malformed records with fully null records, instead of letting the application crash. This approach allows the pipeline to continue processing the remaining data.

Changes: Created a new PermissiveRecordExceptionHandler class. The handle method in this class catches any exceptions during deserialization, logs a warning message, and replaces the malformed record with a fully null record. Tests: A new unit test PermissiveRecordExceptionHandlerSpec has been added. This test checks if the exception handler correctly replaces a problematic record with a fully null record when an exception occurs during deserialization.

The test first constructs an expected SpecificInternalRow with null values then triggers the exception handler by passing a new Exception and checks if the output matches the expected null record.

Test file: PermissiveRecordExceptionHandlerSpec.scala

Linked issue: https://github.com/AbsaOSS/ABRiS/issues/318

willianmrs commented 1 year ago

I've encountered an issue with the GitHub Actions workflow for this pull request. The error message "Error: HttpError: Resource not accessible by integration" indicates that the integration or GitHub App used for running the workflow does not have the necessary permissions to access the specified resource.

cerveada commented 1 year ago

@miroslavpojer Hi, do you know what could cause this issue? It is happening during test coverage phase.

miroslavpojer commented 1 year ago

@miroslavpojer Hi, do you know what could cause this issue? It is happening during test coverage phase.

I' am doing analysis what is happening here. For now it looks to be this Issue. I am waiting for DevOps opinion for action github token configuration. I did code coverage manually to confirm state and get results missing here.

Build Scala 2.12, Spark 3.2 Reached 90% - OK. image

Build Scala 2.13, Spark 3.2 Reached 90% - OK. image

Zejnilovic commented 1 year ago

Hello @willianmrs, please update your branch with the newest master having, what I perceive to be, the fix for the GitHub action.

cerveada commented 1 year ago

I have merged the master myself, but it still fails.

Zejnilovic commented 1 year ago
image

Permissions are still not getting read correctly 🤷‍♂️

cerveada commented 1 year ago

There is an issue with the code coverage plugin, but this PR LGTM.

willianmrs commented 1 year ago

Hello guys, I saw that @cerveada already made the merging, but the error remains... Any idea?

github-actions[bot] commented 1 year ago

JaCoCo code coverage report - Scala 2.13 & Spark 3.2

There is no coverage information present for the Files changed

Total Project Coverage 68.67% :green_apple:
github-actions[bot] commented 1 year ago

JaCoCo code coverage report - Scala 2.12 & Spark 3.2

There is no coverage information present for the Files changed

Total Project Coverage 69.39% :green_apple:
miroslavpojer commented 1 year ago

Great work. Jacoco GitHub actions works now.