awslabs / python-deequ

Python API for Deequ
Apache License 2.0
733 stars 138 forks source link

Anomaly Check functions not defined when using PyDeequ on Glue #138

Open amalgaonkar opened 1 year ago

amalgaonkar commented 1 year ago

Describe the bug While following this Tutorial: https://github.com/awslabs/python-deequ/blob/master/tutorials/anomaly_detection.ipynb Error:

File "<stdin>", line 4, in <module>
NameError: name 'RateOfChangeStrategy' is not defined

Same error for SimpleThresholdStrategy,RelativeRateOfChangeStrategy etc.

To Reproduce Steps to reproduce the behavior:

  1. Create a Glue Jobs with same code as per the Tutorial : https://github.com/awslabs/python-deequ/blob/master/tutorials/anomaly_detection.ipynb

  2. Except, before importing pydeequ create python environment variable :

    import os
    os.environ["SPARK_VERSION"] = "3.3"
  3. Use Glue Version 4.0 . Spark 3.3

  4. To include Pydeequ module create a setuup.py as shown below :

from setuptools import setup

setup(
  name="pydeequ_module",
  version="0.1",
  packages=['pydeequ_module'],
  install_requires=['pydeequ==1.1.0rc0','sagemaker_pyspark']
)

Copy the .whl file to s3 location and refer it as additional python libraries in Glue job. Reference : https://repost.aws/knowledge-center/glue-import-error-no-module-named

  1. Use the dependent jar file as deequ-2.0.3-spark-3.3.jar
  2. Run the job the error shows up as :NameError: name 'RelativeRateOfChangeStrategy' is not defined

Expected behavior The job should succeed identifying anomaly.

amalgaonkar commented 1 year ago

As a work around following helped :

Glue's spark version was not being used. Had to explicitly add it as a python environment variable. and then imported additional pydeequ packages.

import os
os.environ["SPARK_VERSION"] = "3.3"
from pydeequ.analyzers import *
from pydeequ.anomaly_detection import *
BertGuillemyn commented 1 year ago

As a work around following helped :

Glue's spark version was not being used. Had to explicitly add it as a python environment variable. and then imported additional pydeequ packages.

import os
os.environ["SPARK_VERSION"] = "3.3"
from pydeequ.analyzers import *
from pydeequ.anomaly_detection import *

I had the same problem on Databricks, thanks for the workaround!