awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

ColumnProfilerRunner of Pydeequ is not working appropriatley #43

Open ManipalReddy opened 3 years ago

ManipalReddy commented 3 years ago

Describe the bug ColumnProfilerRunner of Pydeequ is not working on any of the data that I used for test. I've used it on both internal data and also the custom created Data Frame that was given in the Pydeequ documentation. I ran the job in AWS Glue ETL.

The underlying Deequ jar version that we are using is: deequ-1.0.2.jar

I believe that the Pydeequ version we are using is 0.1.5.

Expected behavior I would expect the ColumnProfilerRunner of Pydeequ should give the profiling results. Instead, it is failing with: py4j.protocol.Py4JError: An error occurred while calling o226.kll. Trace: py4j.Py4JException: Method kll([]) does not exist

Screenshots Screenshot from the Glue ETL logs: image

Additional context ConstraintSuggestionRunner and Analyzers are working fine on the same data that I have used and in the same environment that I ran the code in.

jaoanan1126 commented 3 years ago

Hi @ManipalReddy, we have yet to test PyDeequ with Deequ 1.0.2. Currently, PyDeequ uses Deequ 1.1.0 - spark-2.4-scala-2.11. I'll add it to our to-do list!

jinyang08 commented 3 years ago

Hi @jaoanan1126 can I use Deequ 1.1.0 with spark 3.1.1? I am currently using deequ 1.2.2, but a lot of functions are not working. Can I go back to 1.1.0 without switching the spark version? Also which pydeequ version goes with deequ 1.1.0?