The job hangs completely if no workarounds are used: #1306
To Reproduce
Steps to reproduce the behavior OR commands run:
Create a schema with nested arrays of structs.
Create a dataset with 20 conformance rules or so. Some of the rules should operate inside an array.
Run streaming conformance.
Expected behaviour
Since all transformations are just projections the performance of a conformance job should be good.
Additional context
Ideas of a solution:
[ ] Tweak Catalyst workaround
[ ] Rewrite conformance interpreter so that it uses only one single traversal for all conformance rules. We need to take in to account that rules are dependent. All dependencies need to be resolved in order to be applicable in a single traversal.
[ ] Rewrite conformance interpreter using RDDs of Row + schema. Use a mapping lambda function to do conformance transformations in an imperative way. This way we use Spark only as a computation engine. Since no Catalyst is used, the Catalyst bug won't affect the computation.
Describe the bug
Streaming Conformance takes way too much time to warm up (20 minutes) and to process. a single micro-batch (8 minutes).
This is way too slow.
This happens due to the catalyst issue reported to Spark: https://issues.apache.org/jira/browse/SPARK-28090
And we have a workaround for batch: #190, #413
The job hangs completely if no workarounds are used: #1306
To Reproduce
Steps to reproduce the behavior OR commands run:
Expected behaviour
Since all transformations are just projections the performance of a conformance job should be good.
Additional context
Ideas of a solution:
Row
+ schema. Use a mapping lambda function to do conformance transformations in an imperative way. This way we use Spark only as a computation engine. Since no Catalyst is used, the Catalyst bug won't affect the computation.