awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

Getting error while running pydeeque locally on spark #35

Open sharwandhidariya opened 3 years ago

sharwandhidariya commented 3 years ago

I am trying to import pydeeque and while running locally on window machine , I am getting following error .

answer = 'xro55', gateway_client = <py4j.java_gateway.GatewayClient object at 0x000001A0FFEA8B70>, target_id = None, name = 'com.amazon.deequ.analyzers.Completeness'

def get_return_value(answer, gateway_client, target_id=None, name=None):
    """Converts an answer received from the Java gateway into a Python object.

    For example, string representation of integers are converted to Python
    integer, string representation of objects are converted to JavaObject
    instances, etc.

    :param answer: the string returned by the Java gateway
    :param gateway_client: the gateway client used to communicate with the Java
        Gateway. Only necessary if the answer is a reference (e.g., object,
        list, map)
    :param target_id: the name of the object from which the answer comes from
        (e.g., *object1* in `object1.hello()`). Optional.
    :param name: the name of the member from which the answer comes from
        (e.g., *hello* in `object1.hello()`). Optional.
    """
    if is_error(answer)[0]:
        if len(answer) > 1:
            type = answer[1]
            value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
            if answer[1] == REFERENCE_TYPE:
                raise Py4JJavaError(
                    "An error occurred while calling {0}{1}{2}.\n".
                  format(target_id, ".", name), value)

E py4j.protocol.Py4JJavaError: An error occurred while calling None.com.amazon.deequ.analyzers.Completeness. E : java.lang.NoClassDefFoundError: scala/Product$class E at com.amazon.deequ.analyzers.Completeness.(Completeness.scala:27) E at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) E at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) E at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) E at java.lang.reflect.Constructor.newInstance(Constructor.java:423) E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) E at py4j.Gateway.invoke(Gateway.java:238) E at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) E at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) E at py4j.GatewayConnection.run(GatewayConnection.java:238) E at java.lang.Thread.run(Thread.java:748) E Caused by: java.lang.ClassNotFoundException: scala.Product$class E at java.net.URLClassLoader.findClass(URLClassLoader.java:382) E at java.lang.ClassLoader.loadClass(ClassLoader.java:418) E at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) E at java.lang.ClassLoader.loadClass(ClassLoader.java:351) E ... 12 more

C:\SPARK\spark-2.4.2-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py:328: Py4JJavaError

Here are my liberary versions deequ-1.0.2.jar pyspark 2.4.7

gustavorps commented 3 years ago

+1

$ pip install pydeequ
Collecting pydeequ
  Downloading pydeequ-0.1.7-py3-none-any.whl (34 kB)

>>> pyspark.__version__
'3.1.2'

>> result = pydeequ.profiles.ColumnProfilerRunner(spark) \
...                               .onData(df) \
...                               .run()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gustavorps/.local/miniconda3/lib/python3.9/site-packages/pydeequ/profiles.py", line 103, in run
    run = self._ColumnProfilerRunBuilder.run()
  File "/home/gustavorps/.local/miniconda3/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/gustavorps/.local/miniconda3/lib/python3.9/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/home/gustavorps/.local/miniconda3/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o92.run.
: java.lang.NoClassDefFoundError: scala/Product$class
        at com.amazon.deequ.profiles.ColumnProfilerRunBuilderFileOutputOptions.<init>(ColumnProfilerRunner.scala:31)
        at com.amazon.deequ.profiles.ColumnProfilerRunBuilder.run(ColumnProfilerRunBuilder.scala:174)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:829)
vinura commented 3 years ago

GEt the same error

y4JJavaError: An error occurred while calling o285.run. : java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAggregateExpression(Z)Lorg/apache/spark/sql/catalyst/expressions/aggregate/AggregateExpression; at org.apache.spark.sql.DeequFunctions$.withAggregateFunction(DeequFunctions.scala:31) at org.apache.spark.sql.DeequFunctions$.stateful_approx_count_distinct(DeequFunctions.scala:60) at com.amazon.deequ.analyzers.ApproxCountDistinct.aggregationFunctions(ApproxCountDistinct.scala:52) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$3(AnalysisRunner.scala:319) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.immutable.List.flatMap(List.scala:355) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.liftedTree1$1(AnalysisRunner.scala:319) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.runScanningAnalyzers(AnalysisRunner.scala:318) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:167) at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:110) at com.amazon.deequ.profiles.ColumnProfiler$.profile(ColumnProfiler.scala:141) at com.amazon.deequ.profiles.ColumnProfilerRunner.run(ColumnProfilerRunner.scala:72) at com.amazon.deequ.profiles.ColumnProfilerRunBuilder.run(ColumnProfilerRunBuilder.scala:185) at com.amazon.deequ.suggestions.ConstraintSuggestionRunner.profileAndSuggest(ConstraintSuggestionRunner.scala:203) at com.amazon.deequ.suggestions.ConstraintSuggestionRunner.run(ConstraintSuggestionRunner.scala:102) at com.amazon.deequ.suggestions.ConstraintSuggestionRunBuilder.run(ConstraintSuggestionRunBuilder.scala:226) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Unknown Source)

harshareddy87 commented 2 years ago

Did anyone of you able to find a solution for this issue? Started running pydeequ locally and couldn't get it working.

codigoscupom commented 2 years ago

fixed it I am running it on MacOS. I had to downgrade Spark, Scala, PySpark: Current setup: pydeequ 1.0.1 Scala 2.11 Java 8

Created env variable SPARK_VERSION="2.3.2"

Configured SPARK_HOME export SPARK_HOME=/usr/local/Cellar/apache-spark@2.3.2/2.3.2/libexec

Spark and PySpark are both on version 2.3.2


same here: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAggregateExpression(Z)Lorg/apache/spark/sql/catalyst/expressions/aggregate/AggregateExpression

hugowschneider commented 2 years ago

Same here

Code is the same as the example

...
result = ColumnProfilerRunner(spark) \
            .onData(dataframe) \
            .run()

        for col, profile in result.profiles.items():
            print(profile)
...
Error
Traceback (most recent call last):
  File "/Users/___/Development/pyspark/tests/deequ_profile_test.py", line 24, in test_data_frame_schema
    result = ColumnProfilerRunner(self.spark) \
  File "/Users/___/opt/anaconda3/envs/pyspark/lib/python3.8/site-packages/pydeequ/profiles.py", line 121, in run
    run = self._ColumnProfilerRunBuilder.run()
  File "/Users/___/opt/anaconda3/envs/pyspark/lib/python3.8/site-packages/py4j/java_gateway.py", line 1309, in __call__
    return_value = get_return_value(
  File "/Users/___/opt/anaconda3/envs/pyspark/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/___/opt/anaconda3/envs/pyspark/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o45.run.
: java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAggregateExpression(Z)Lorg/apache/spark/sql/catalyst/expressions/aggregate/AggregateExpression;
    at org.apache.spark.sql.DeequFunctions$.withAggregateFunction(DeequFunctions.scala:31)
    at org.apache.spark.sql.DeequFunctions$.stateful_approx_count_distinct(DeequFunctions.scala:60)
    at com.amazon.deequ.analyzers.ApproxCountDistinct.aggregationFunctions(ApproxCountDistinct.scala:52)
    at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$3(AnalysisRunner.scala:319)
    at scala.collection.immutable.List.flatMap(List.scala:366)
    at com.amazon.deequ.analyzers.runners.AnalysisRunner$.liftedTree1$1(AnalysisRunner.scala:319)
    at com.amazon.deequ.analyzers.runners.AnalysisRunner$.runScanningAnalyzers(AnalysisRunner.scala:318)
    at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:167)
    at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:110)
    at com.amazon.deequ.profiles.ColumnProfiler$.profile(ColumnProfiler.scala:141)
    at com.amazon.deequ.profiles.ColumnProfilerRunner.run(ColumnProfilerRunner.scala:72)
    at com.amazon.deequ.profiles.ColumnProfilerRunBuilder.run(ColumnProfilerRunBuilder.scala:185)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:748)

Java: 1.8 Spark: 3.2.0 Pydeequ: 1.0.1

IanHopkinson commented 2 years ago

I had a problem that looked like this, I was using the Linkedin DataHub data-lake ingestor which is based on pydeequ/deequ but I had previously experimented with pydeequ separately. My fix was just change the SPARK_VERSION environment variable from "2.4.7" to "3.03" - I'd provided an updated Spark (3.0.3) to work with DataHub but had the version for my old install in the environment variable.

matef commented 2 years ago

I still have the issue. Is this library compatible with Spark 3.2.1 running on Java 11.0.14 and Scala 2.12.15?

ghirardinicola commented 2 years ago

Not it's not, but this error should not be the reason. Using the last version should solve (0.1.8)

sshruti23 commented 2 years ago

I'm also facing the same issue My systems's configuration pip install pyspark==3.0.1 Java 8 Python 3.7.9 Pydeequ 1.0.1

I'm getting this error while using this step from the documentation guide result = ColumnProfilerRunner(spark) \ .onData(df) \ .run()

Can the team help with updating documentation with the supported versions of Java, Scala , Python , Pyspark to be used with latest pydeequ 1.0.1

Error : Traceback (most recent call last): File "PyDeequ_Checks.py", line 58, in .addConstraintRule(DEFAULT()) \ File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pydeequ/suggestions.py", line 81, in run result = self._ConstraintSuggestionRunBuilder.run() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/py4j/java_gateway.py", line 1322, in call answer, self.gateway_client, self.target_id, self.name) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pyspark/sql/utils.py", line 190, in deco return f(*a, **kw) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o50.run. : java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction.toAggregateExpression(Z)Lorg/apache/spark/sql/catalyst/expressions/aggregate/AggregateExpression; at org.apache.spark.sql.DeequFunctions$.withAggregateFunction(DeequFunctions.scala:31) at org.apache.spark.sql.DeequFunctions$.stateful_approx_count_distinct(DeequFunctions.scala:60) at com.amazon.deequ.analyzers.ApproxCountDistinct.aggregationFunctions(ApproxCountDistinct.scala:52) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$3(AnalysisRunner.scala:319) at scala.collection.immutable.List.flatMap(List.scala:366) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.liftedTree1$1(AnalysisRunner.scala:319) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.runScanningAnalyzers(AnalysisRunner.scala:318) at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:167) at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:110) at com.amazon.deequ.profiles.ColumnProfiler$.profile(ColumnProfiler.scala:141) at com.amazon.deequ.profiles.ColumnProfilerRunner.run(ColumnProfilerRunner.scala:72) at com.amazon.deequ.profiles.ColumnProfilerRunBuilder.run(ColumnProfilerRunBuilder.scala:185) at com.amazon.deequ.suggestions.ConstraintSuggestionRunner.profileAndSuggest(ConstraintSuggestionRunner.scala:203) at com.amazon.deequ.suggestions.ConstraintSuggestionRunner.run(ConstraintSuggestionRunner.scala:102) at com.amazon.deequ.suggestions.ConstraintSuggestionRunBuilder.run(ConstraintSuggestionRunBuilder.scala:226) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748)

dvrbanic-syntio commented 1 year ago

Has anyone been able to find a solution for this issue when running: result = ColumnProfilerRunner(spark).onData(df).run()