awslabs / python-deequ

Python API for Deequ
Apache License 2.0
713 stars 134 forks source link

Spark Operator on EKS - Spark 3.1.1 #110

Open sh-manish opened 1 year ago

sh-manish commented 1 year ago

Here is the config

Spark- 3.1.1 Hadoop - 3.2 Deequ - 2.0.0-spark-3.1

There is an error with the ColumnProfileRunner (other methods are working well)

Traceback (most recent call last): File "/tmp/spark-fd2d4030-8ffc-41b9-a48d-864814ddfe79/profiler_v2.py", line 127, in start_profiling_job profiling_job.start() File "/tmp/spark-fd2d4030-8ffc-41b9-a48d-864814ddfe79/profiler_v2.py", line 173, in start ColumnProfilerRunner(self.spark) File "/usr/local/lib/python3.9/dist-packages/pydeequ/profiles.py", line 121, in run run = self._ColumnProfilerRunBuilder.run() File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in call return_value = get_return_value( File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o111.run. : com.amazon.deequ.analyzers.runners.MetricCalculationRuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 7) (192.168.137.235 executor 4): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.HashAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate.HashAggregateExec

Not sure if anyone has encountered this error

chenliu0831 commented 1 year ago

@sh-manish could you try higher Spark version and later Deequ using the mainline of PyDeequ?