awslabs / python-deequ

Python API for Deequ
Apache License 2.0
669 stars 131 forks source link

Example for running on AWS (Glue, EMR, SageMaker, etc) #140

Open chenliu0831 opened 1 year ago

chenliu0831 commented 1 year ago

Is your feature request related to a problem? Please describe. Running PyDeequ on Glue, EMR is not always easy due to the packaging and unique env settings etc. Users commonly need to trail and error a while.

Describe the solution you'd like

it would be good to provide a sample notebook of some sort that are ready to use. Perhaps a code snippets as well.

Alternatives

It may worth consider an AWS service integration helper module (data sources & compute service).

Workaround available:

candicexu918 commented 11 months ago

The new released pydeequ can be installed successfully in Glue but somehow failed to be imported with error "RuntimeError: SPARK_VERSION environment variable is required. Supported values are: dict_keys(['3.3', '3.2', '3.1', '3.0', '2.4'])", which results from the get_spark_version function in the config file.

In Glue when I ran the below code it returned null -

SPARK_VERSION = os.environ.get('SPARK_VERSION')
print(SPARK_VERSION)

But when I ran print(spark.version[:3]) it did return '3.1' as expected.

Is this something can be enhanced in the package or should I explore other ways to use pydeequ in Glue?

chenliu0831 commented 11 months ago

@candicexu918 This is a workaround https://github.com/awslabs/python-deequ/issues/138#issuecomment-1611575546. Glue may have a unique setup of how you could pass env var to Python process