awslabs / python-deequ

Python API for Deequ
Apache License 2.0
676 stars 131 forks source link

Not able to run basic_example.ipynb due to Exception: Java gateway process exited before sending its port number #69

Open oonisim opened 2 years ago

oonisim commented 2 years ago

Describe the bug Cannot execute tutorials/basic_example.ipynb from within the SageMaker Studio deployed in a VPC which has the Internet access. VPC endpoints to S3, SageMaker API, Runtime, CloudWatch log have been created in the subnet where the Studio ENI exists.


import sagemaker_pyspark
from pyspark.sql import SparkSession, Row

classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars

spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)

Result:

Please set env variable SPARK_VERSION
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-2-76077136385f> in <module>
     10     .config("spark.driver.extraClassPath", classpath)
     11     .config("spark.jars.packages", pydeequ.deequ_maven_coord)
---> 12     .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
     13     .getOrCreate())

/opt/conda/lib/python3.7/site-packages/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in getOrCreate(cls, conf)
    382         with SparkContext._lock:
    383             if SparkContext._active_spark_context is None:
--> 384                 SparkContext(conf=conf or SparkConf())
    385             return SparkContext._active_spark_context
    386 

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/opt/conda/lib/python3.7/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    329         with SparkContext._lock:
    330             if not SparkContext._gateway:
--> 331                 SparkContext._gateway = gateway or launch_gateway(conf)
    332                 SparkContext._jvm = SparkContext._gateway.jvm
    333 

/opt/conda/lib/python3.7/site-packages/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

To Reproduce

  1. Open the tutorials/basic_example.ipynb in the SageMaker studio.
  2. Run all.

Expected behavior Run without errors.

Screenshots NA

Desktop (please complete the following information): SageMaker Studio in the us-east-2 region. Python 3 Data Science kernel.

Question

Please be specific with the system requirements to be able to run the tutorial notebooks.

  1. Do they work in a SageMaker Studio in VPC?
  2. Is a EMR or a Spark cluster provision required? If yes, what are the configurations required?
  3. Are any other configurations, environment variable settings required?