awslabs / python-deequ

Python API for Deequ
Apache License 2.0
691 stars 132 forks source link

Multiple functions of pydeeq not working using "com.amazon.deequ:deequ:1.2.2-spark-3.0" #67

Open jinyang08 opened 3 years ago

jinyang08 commented 3 years ago

Describe the bug Due to other constraints, I can only use EMR 6.2 or EMR 6.3, both having spark 3+ and scala 2.12. I currently use pydeequ 1.0.1. With the version, many functions under analyzers, profiles, suggestions not working.

To Reproduce Steps to reproduce the behavior:

  1. add config
    %%configure -f
    {
    "conf": {
        "spark.pyspark.python": "python3",
        "spark.pyspark.virtualenv.enabled": "true",
        "spark.pyspark.virtualenv.type":"native",
        "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
        "spark.jars.packages": "com.amazon.deequ:deequ:1.2.2-spark-3.0"
    }
    }
  2. install pydeequ
    sc.install_pypi_package("pydeequ")
  3. run analyers
    
    from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \ .onData(df_bg) \ .addAnalyzer(Correlation("total_votes", "star_rating")) \ .addAnalyzer(ApproxCountDistinct("livongo_user_uuid")) \ .addAnalyzer(ApproxQuantile("bg_value", 0.5))\ .addAnalyzer(ApproxQuantiles('bg_value', [0.25, 0.5, 0.75]))\ .addAnalyzer(Correlation('bg_value', 'insulin_short'))\ .addAnalyzer(StandardDeviation('bg_value'))\ .addAnalyzer(UniqueValueRatio('bg_value'))\ .addAnalyzer(PatternMatch('source', r"O+"))\ .run()

All functions above do not work.

from pydeequ.profiles import * result = ColumnProfilerRunner(spark) \ .onData(df) \ .run() for col, profile in result.profiles.items(): print(profile)



**Expected behavior**
java.lang.NoClassDefFoundError

**Screenshots**
![image](https://user-images.githubusercontent.com/71104563/128293676-a97751a1-e04a-4bbc-9714-cbd9ad20f38b.png)