awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

NoSuchMethodError when attempting to set up analyzer #62

Open cbuffett opened 3 years ago

cbuffett commented 3 years ago

Describe the bug Following the Jupyter tutorial at https://github.com/awslabs/python-deequ/blob/master/tutorials/test_data_quality_at_scale.ipynb, I receive the following exception when attempting to set up the analyzer

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/tmp/ipykernel_2741/3771477838.py in <module>
      3 analysisResult = AnalysisRunner(spark) \
      4                     .onData(df) \
----> 5                     .addAnalyzer(Size()) \
      6                     .addAnalyzer(Completeness("review_id")) \
      7                     .addAnalyzer(ApproxCountDistinct("review_id")) \

~/.local/share/virtualenvs/pydeequ/lib/python3.7/site-packages/pydeequ/analyzers.py in addAnalyzer(self, analyzer)
    132         """
    133         analyzer._set_jvm(self._jvm)
--> 134         _analyzer_jvm = analyzer._analyzer_jvm
    135         self._AnalysisRunBuilder.addAnalyzer(_analyzer_jvm)
    136         return self

~/.local/share/virtualenvs/pydeequ/lib/python3.7/site-packages/pydeequ/analyzers.py in _analyzer_jvm(self)
    704         :return self
    705         """
--> 706         return self._deequAnalyzers.Size(self._jvm.scala.Option.apply(self.where))
    707 
    708 

~/.local/share/virtualenvs/pydeequ/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1523         answer = self._gateway_client.send_command(command)
   1524         return_value = get_return_value(
-> 1525             answer, self._gateway_client, None, self._fqn)
   1526 
   1527         for temp_arg in temp_args:

~/.local/share/virtualenvs/pydeequ/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

~/.local/share/virtualenvs/pydeequ/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling None.com.amazon.deequ.analyzers.Size.
: java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
    at com.amazon.deequ.analyzers.Size.<init>(Size.scala:37)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

To Reproduce Steps to reproduce the behavior:

  1. Follow along the tutorial https://github.com/awslabs/python-deequ/blob/master/tutorials/test_data_quality_at_scale.ipynb
  2. Attempt to run the following Jupyter cell
    
    from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \ .onData(df) \ .addAnalyzer(Size()) \ .addAnalyzer(Completeness("review_id")) \ .addAnalyzer(ApproxCountDistinct("review_id")) \ .addAnalyzer(Mean("star_rating")) \ .addAnalyzer(Compliance("top star_rating", "star_rating >= 4.0")) \ .addAnalyzer(Correlation("total_votes", "star_rating")) \ .addAnalyzer(Correlation("total_votes", "helpful_votes")) \ .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult) analysisResult_df.show()


3. See error in description

**Expected behavior**
Analysis data frame returned per the tutorial

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
 - OS: Ubuntu 20.04 WSL2
 - Browser: Chrome
 - Version: 1.0.0
 - Python: 3.7.9
 - Pyspark: 2.4.0
 - Sagemaker-pyspark: 1.4.2
 - Py4J: 0.10.7
 - Java: OpenJDK 1.8.0_275
cbuffett commented 3 years ago

I was able to work around this by modifying configs.py to return com.amazon.deequ:deequ:1.1.0_spark-2.4-scala-2.11 for key deequ_maven_coord_spark2_4.

jinyang08 commented 3 years ago

Hi @cbuffett to add the key do I have to use an emr cluster with spark 2.4? I am currently on spark 3.0

cbuffett commented 3 years ago

I'm not using EMR, so I can't speak to that. I'm also not sure if this affects newer versions of Spark/Scala. I'm running Jupyter Lab on an Ubuntu VM inside a virtual env, so I think my setup is pretty bare bones. The key I'm referring to is in configs.py, part of the pydeequ package.

cat ~/.local/share/virtualenvs/pydeequ/lib/python3.7/site-packages/pydeequ/configs.py
# -*- coding: utf-8 -*-
import logging
import os

logger = logging.getLogger("logger")
configs = {
    "deequ_maven_coord": "com.amazon.deequ:deequ:1.2.2-spark-3.0",
    "deequ_maven_coord_spark3": "com.amazon.deequ:deequ:1.2.2-spark-3.0",
    #"deequ_maven_coord_spark2_4": "com.amazon.deequ:deequ:1.2.2-spark-2.4",
    "deequ_maven_coord_spark2_4": "com.amazon.deequ:deequ:1.1.0_spark-2.4-scala-2.11",
    "deequ_maven_coord_spark2_2": "com.amazon.deequ:deequ:1.2.2-spark-2.2",
    "f2j_maven_coord": "net.sourceforge.f2j:arpack_combined_all",
}

def _get_spark_version():
    # TODO - Change this later [Use Spark API's instead of env var]
    spark_version: str = os.getenv("SPARK_VERSION")
    return spark_version

def set_deequ_maven_config():
    spark_version = _get_spark_version()
    if spark_version is None:
        logger.error("Please set env variable SPARK_VERSION")
        logger.info(f"Using deequ: {configs['deequ_maven_coord']}")
        return configs["deequ_maven_coord"]  # TODO
    if spark_version[0:3] == "3.0":
        logger.info("Setting spark-3.0 as default version of deequ")
        configs["deequ_maven_coord"] = configs["deequ_maven_coord_spark3"]
    elif spark_version[0:3] == "2.4":
        logger.info("Setting spark-2.4 as default version of deequ")
        configs["deequ_maven_coord"] = configs["deequ_maven_coord_spark2_4"]
    elif spark_version[0:3] == "2.2":
        logger.info("Setting spark3 as default version of deequ")
        configs["deequ_maven_coord"] = configs["deequ_maven_coord_spark2_2"]
    else:
        logger.error(f"Deequ is still not supported in spark version: {spark_version}")
        logger.info(f"Using deequ: {configs['deequ_maven_coord']}")
        return configs["deequ_maven_coord"]  # TODO

    logger.info(f"Using deequ: {configs['deequ_maven_coord']}")
    return configs["deequ_maven_coord"]

I'm forced to use Pyspark 2.4 due to using Sagemaker-pyspark as part of the tutorial, though for my real use case, I don't foresee having this dependency. But as it stands, the tutorial steps and package dependencies render the tutorial broken.

jinyang08 commented 3 years ago

@cbuffett Thank you! I have to use spark 3.1, which has scala 2.12 due to other dependency constraints. I guess I will have to wait till the release of the scala-2.11 compatible pydeequ.

rodrigompp commented 2 years ago

I was able to work around this by modifying configs.py to return com.amazon.deequ:deequ:1.1.0_spark-2.4-scala-2.11 for key deequ_maven_coord_spark2_4.

This issue is often related to a version issue of Scala, so probably Scala 2.11 it's installed in your env.

cbuffett commented 2 years ago

I don't have Scala installed in my environment, at least not explicitly. My understanding is that this config file effectively dynamically loads the package from Maven, but the stock version that's specified doesn't work. It looks like a hotfix was submitted on Jul 29 (https://github.com/awslabs/python-deequ/commit/91e693d4eea00110aae2d2e8a4a14609298ef2ab), though at the time I encountered the issue, this fix hadn't been released.

NairPrasanth commented 2 years ago

for me when i was trying in databricks, i got similar issue. When i set the Environment variable SPARK_VERSION=3.0.1 issue go resolved. its available in pydeequ release notes, but no forum has this as the solution for this specific issue. so i had to try all the options available before accidently noticing this note and trying it out.