aws / sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.
Apache License 2.0
36 stars 74 forks source link

PySparkProcessor Error reading parquet from S3 (may be version compatibility issue) #63

Closed carmelotony closed 3 years ago

carmelotony commented 3 years ago

Getting the following error submitting a processing job with PySparkProcessor running on VPC subnet with security group.

py4j.protocol.Py4JJavaError: An error occurred while calling o47.parquet.

Traceback (most recent call last): File "/opt/ml/processing/input/code/processing.py", line 100, in main()
File "/opt/ml/processing/input/code/processing.py", line 36, in main df = spark.read.parquet("s3://path-to-parquet/") File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 316, in parquet File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value

spark_processor = PySparkProcessor( base_job_name = 'one-ckd-poc', framework_version="2.4", py_version="py37", container_version="1.3", role=role, instance_count=10, volume_size_in_gb = 100, instance_type="ml.m5.2xlarge", max_runtime_in_seconds=1200, network_config = network

spark_processor.run( submit_app="./processing.py", spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, s3_sparklog_prefix), logs=False

carmelotony commented 3 years ago

Marking as resolved. This was a kms permission problem.