awslabs / python-deequ

Python API for Deequ
Apache License 2.0
731 stars 138 forks source link

py4j.protocol.Py4JError while adding Analyzer in PyDeequ + PySpark #154

Open Indu-sharma opened 1 year ago

Indu-sharma commented 1 year ago

Describe the bug I'm trying to add analysers for the columns of a PySpark Data Frame, I'm getting py4j.protocol.Py4JError.

To Reproduce Steps to reproduce the behavior:

  1. Install pydeequ==1.1.0, py4j==0.10.9.7 , pyspark==3.4.1 and Spark == 3.3.0.
  2. Read CSV to PySpark DataFrame.
  3. Instantiate the Analsysis Runner :: analysis_runner = AnalysisRunner(spark).onData(df).addAnalyzer(Size())
  4. Add Completeness and Uniqueness DQ rules for every columns as:

for column in df.columns: analysis_runner.addAnalyzer(Uniqueness(column)) analysis_runner.addAnalyzer(Completeness(column)) analysisResult= analysis_runner.run()

  1. Run The Pyspark script to see following error:

    analysis_runner.addAnalyzer(Uniqueness(column)) File "/Users/indusharma/Library/Python/3.9/lib/python/site-packages/pydeequ/analyzers.py", line 134, in addAnalyzer _analyzer_jvm = analyzer._analyzer_jvm File "/Users/indusharma/Library/Python/3.9/lib/python/site-packages/pydeequ/analyzers.py", line 775, in _analyzer_jvm to_scala_seq(self._jvm, self.columns), self._jvm.scala.Option.apply(self.where) File "/Users/indusharma/Library/Python/3.9/lib/python/site-packages/pydeequ/scala_utils.py", line 80, in to_scala_seq return jvm.scala.collection.JavaConversions.iterableAsScalaIterable(iterable).toSeq() File "/Users/indusharma/Library/Python/3.9/lib/python/site-packages/py4j/java_gateway.py", line 1322, in call return_value = get_return_value( File "/Users/indusharma/Library/Python/3.9/lib/python/site-packages/pyspark/errors/exceptions/captured.py", line 169, in deco return f(*a, **kw) File "/Users/indusharma/Library/Python/3.9/lib/python/site-packages/py4j/protocol.py", line 330, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:scala.collection.JavaConversions.iterableAsScalaIterable. Trace: py4j.Py4JException: Method iterableAsScalaIterable([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:321) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) at py4j.Gateway.invoke(Gateway.java:276) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750)

Expected behavior Analysers are run successfully.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information): N/A

Smartphone (please complete the following information): N/A

Additional context

Indu-sharma commented 1 year ago

I figured out that issue is with analysis_runner.addAnalyzer(Uniqueness(column)) ie. Uniqueness rule.

And
This line of code has the issue. return jvm.scala.collection.JavaConversions.iterableAsScalaIterable(iterable).toSeq()

chenliu0831 commented 1 year ago

Hmm, Uniqueness has been tested https://github.com/awslabs/python-deequ/blob/ef6e6afb2f1403d2528e9eb1bf67506754ec127d/tests/test_analyzers.py#L245. Could you take a look at the test case? Looks like you should pass a list of columns instead?