Closed naveencha closed 3 years ago
We are working on it at the moment! Keep your eyes out :)
Hi @naveencha ! This should do the trick!
Your EMR cluster must be running Spark v2.4.6 in order to work with PyDeequ. Once you have a running cluster that has those components and a SageMaker notebook with the necessary permissions, you can configure a SparkSession object from the below template to connect to your cluster. If you need a refresher on how to connect a SageMaker Notebook to EMR, check out this AWS blogpost on using Sparkmagic.
Once you’re in the SageMaker Notebook, run the following JSON in a cell before you start your SparkSession to configure your EMR cluster.
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
"spark.jars.packages": "com.amazon.deequ:deequ:1.0.3",
"spark.jars.excludes": "net.sourceforge.f2j:arpack_combined_all"
}
}
Start your SparkSession object in a cell after the above configuration by running spark, and you should be good to go!
Ah forgot a last section @naveencha , but once you have your SparkSession started, use the SparkContext (default named sc
) to install PyDeequ onto your cluster like so
sc.install_pypi_package('pydeequ')
Hi Team, is this part of the installation now ? I am still facing this error in 2022 with pydeequ and running it on EMR gives me tthe same error. I work on a managed EMR instance so moving jars manually might not be feasible. Any workarounds? @gucciwang ?
I didn't find any example of running it on EMR Cluster.
It would be nice if we have steps to run it on EMR cluster from Jupyter notebook.