Open chenliu0831 opened 1 year ago
The new released pydeequ can be installed successfully in Glue but somehow failed to be imported with error "RuntimeError: SPARK_VERSION environment variable is required. Supported values are: dict_keys(['3.3', '3.2', '3.1', '3.0', '2.4'])", which results from the get_spark_version function in the config file.
In Glue when I ran the below code it returned null -
SPARK_VERSION = os.environ.get('SPARK_VERSION')
print(SPARK_VERSION)
But when I ran print(spark.version[:3])
it did return '3.1' as expected.
Is this something can be enhanced in the package or should I explore other ways to use pydeequ in Glue?
@candicexu918 This is a workaround https://github.com/awslabs/python-deequ/issues/138#issuecomment-1611575546. Glue may have a unique setup of how you could pass env var to Python process
Is your feature request related to a problem? Please describe. Running PyDeequ on Glue, EMR is not always easy due to the packaging and unique env settings etc. Users commonly need to trail and error a while.
Describe the solution you'd like
it would be good to provide a sample notebook of some sort that are ready to use. Perhaps a code snippets as well.
Alternatives
It may worth consider an AWS service integration helper module (data sources & compute service).
Workaround available: