awslabs / python-deequ

Python API for Deequ
Apache License 2.0
702 stars 132 forks source link

Any Example of running it on EMR cluster !!! #2

Closed naveencha closed 3 years ago

naveencha commented 3 years ago

I didn't find any example of running it on EMR Cluster.

It would be nice if we have steps to run it on EMR cluster from Jupyter notebook.

gucciwang commented 3 years ago

We are working on it at the moment! Keep your eyes out :)

gucciwang commented 3 years ago

Hi @naveencha ! This should do the trick!

Your EMR cluster must be running Spark v2.4.6 in order to work with PyDeequ. Once you have a running cluster that has those components and a SageMaker notebook with the necessary permissions, you can configure a SparkSession object from the below template to connect to your cluster. If you need a refresher on how to connect a SageMaker Notebook to EMR, check out this AWS blogpost on using Sparkmagic.

Once you’re in the SageMaker Notebook, run the following JSON in a cell before you start your SparkSession to configure your EMR cluster.

%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
          "spark.jars.packages": "com.amazon.deequ:deequ:1.0.3",
          "spark.jars.excludes": "net.sourceforge.f2j:arpack_combined_all"
         }
}

Start your SparkSession object in a cell after the above configuration by running spark, and you should be good to go!

gucciwang commented 3 years ago

Ah forgot a last section @naveencha , but once you have your SparkSession started, use the SparkContext (default named sc) to install PyDeequ onto your cluster like so

sc.install_pypi_package('pydeequ')
DebanjanBanerjeeQB commented 1 year ago

Hi Team, is this part of the installation now ? I am still facing this error in 2022 with pydeequ and running it on EMR gives me tthe same error. I work on a managed EMR instance so moving jars manually might not be feasible. Any workarounds? @gucciwang ?