AbsaOSS / enceladus

Dynamic Conformance Engine
Apache License 2.0
29 stars 14 forks source link

SparkJobs return code based on data quality #1542

Open benedeki opened 3 years ago

benedeki commented 3 years ago

Background

SparkJobs return non-zero return code when there's an incorrect data definition (data not adhering to schema for example) or some other exception.

Feature

It has been suggested, to allow a more nuanced processing pipe to be possible to define a data quality threshold. If that is breached SparkJobs would end with a non-zero exit code too (different from the other exceptions).

Proposed Solution

Needs to figure out how to specify the data-quality thresholds and then make the checks. Possibly this can change into an Epic.

Zejnilovic commented 3 years ago

Collecting data from withing a distributed app could be bad. However I look at it, this will need a 3rd party or outside app monitoring the job. E.g. Menas would need to read those DataQuality messages provided by the Kafka plugin, monitoring the jobs and then killing it using yarn command if it sees too many "fails".

benedeki commented 3 years ago

I don't think the requested feature in necessary killing the job if it gets over the threshold. (In such case your suggestion would indeed be probably the only one working.) But it would actually mean no result from the Spark job, and therefore also limited information what were the quality issues. But the requested feature was, that it would be enough, the return code to be added at the end only,as a "result" of post-processing.