[BUG] PySaprk Job fails on EMR-on-EKS with message Internal Failure in retrieving data

manojrajpurohit commented 1 year ago

Describe the bug When we try to submit a pyspark job on EMR-on-EKS cluster, the job fails with error message "Internal Failure in retrieving data".

To Reproduce

Deploy ADDF by including module modules/core/emr-on-eks/ in the manifest.
refer manifest yaml here -> https://github.com/manojrajpurohit/autonomous-driving-data-framework/blob/trigger-spark-on-eks/manifests/ros-image-demo/emr-modules.yaml
submit a pyspark script from cloud9 , you may use below command for reference
use command aws emr-containers list-virtual-clusters to get virtual cluster id

aws emr-containers start-job-run \
--virtual-cluster-id <virtual cluster id> \
--name scene_detection_manual \
--execution-role-arn <EMR execution role> \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
    "entryPoint": "s3://<bucket_name>/path/to/pyspark/script.py",
    "entryPointArguments": ["--batch-metadata-table-name addf-ros-image-demo-dags-aws-drive-tracking --batch-id 6_dec_2022 --bucket addf-ros-image-demo-curated-bucket-75fe6115 --region ap-south-1 --output-dynamo-table addf-ros-image-demo-dags-aws-scenes"],
    "sparkSubmitParameters": "--conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.driver.memory=2G --conf spark.executor.cores=2 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false --packages com.audienceproject:spark-dynamodb_2.12:1.1.1"
  }
}'

Expected behavior The pyspark jobs should run successfully

Screenshots EMR on EKS

Additional context

This execution was actually performed for scene detection , thus you may use this PySpark code. Thus, this entryPointArguments are provided to match the execution of this script. However, this script requires previous steps of ADDF to be completed with data in respective buckets and Dynamo tables thus you may run any simple pyspark script to reproduce this error.

dgraeber commented 1 year ago

Is there anyway to determine what bucket it cannot access? Is it the bucket where the pyspark code resides or is it a bucket where data resides? Can you look in the logs for any more details?

manojrajpurohit commented 1 year ago

One thing that I am not able to understand is, my EMR virtual cluster's EKS namespace is emr-eks-spark , refer point 1 below, whereas there is no such namespace called "emr-eks-spark" in the EKS cluster , refer point 2 below. My understanding is, EKS namespace should be created and only then EMR virtual cluster can be created (I may be wrong). But if a namespace is a hard per-requisite for EMR virtual cluster then the namespace emr-eks-spark existed at some point in time but I am unable to find any place in ADDF where EKS namespace emr-eks-spark is created.

1. describe virtual cluster to get namespace 
(.venv) (base) aws emr-containers describe-virtual-cluster --id <masked>

{
    "virtualCluster": {
        "id": "<masked>",
        "name": "addf-ros-image-demo-emr-emr-<masked>",
        "arn": "arn:aws:emr-containers:<masked>:<masked>:/virtualclusters/<masked>",
        "state": "RUNNING",
        "containerProvider": {
            "type": "EKS",
            "id": "addf-ros-image-demo-core-eks-cluster",
            "info": {
                "eksInfo": {
                    "namespace": "**emr-eks-spark**"
                }
            }
        },
        "createdAt": "<masked>",
        "tags": {
            "Deployment": "addf-ros-image-demo"
        }
    }
}

2. describe all namespace 
(.venv) (base) kubectl describe ns
Name:         default
Labels:       kubernetes.io/metadata.name=default
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         kube-node-lease
Labels:       kubernetes.io/metadata.name=kube-node-lease
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         kube-public
Labels:       kubernetes.io/metadata.name=kube-public
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Name:         kube-system

manojrajpurohit commented 1 year ago

Is there anyway to determine what bucket it cannot access? Is it the bucket where the pyspark code resides or is it a bucket where data resides? Can you look in the logs for any more details?

I tried submitting the pyspark job by specifying the S3 Log bucket and Cloudwatch logs , refer code below, job submitted successfully, job failed with same error as stated in issue, logs were not seen in S3 neither in Cloudwatch logs. I had provided temporary elevated access to "execution-role-arn" before submitting job so it doesn't seem like IAM access issue on EMR JOb's end.

aws emr-containers start-job-run \
--virtual-cluster-id <masked>\
--name scene_detection_manual \
--execution-role-arn arn:aws:iam::<masked>:role/addf-ros-image-demo-emr-e-<masked> \
--release-label emr-6.8.0-latest \
--job-driver '{
  "sparkSubmitJobDriver": {
    "entryPoint": "s3://addf-ros-image-demo-artifacts-bucket-<masked>/dags/ros-image-demo/dags-aws/spark_scripts/detect_scenes.py",
    "entryPointArguments": ["--batch-metadata-table-name addf-ros-image-demo-dags-aws-drive-tracking --batch-id 6_dec_2022 --bucket addf-ros-image-demo-curated-bucket-<masked>--region ap-south-1 --output-dynamo-table addf-ros-image-demo-dags-aws-scenes"],
    "sparkSubmitParameters": "--conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.driver.memory=2G --conf spark.executor.cores=2 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false --packages com.audienceproject:spark-dynamodb_2.12:1.1.1"
  }
}' --configuration-overrides '{
  "monitoringConfiguration": {
    "cloudWatchMonitoringConfiguration": {
      "logGroupName": "/emr-on-eks/emr-on-eks-to-delete",
      "logStreamNamePrefix": "detect_scenes_todelete"
    },
    "s3MonitoringConfiguration": {
       "logUri": "s3://addf-ros-image-demo-logs-bucket-<masked>/emr-on-eks"
    }
  }
}'

manojrajpurohit commented 1 year ago

Since the namespace 'emr-eks-spark' did not exists, we followed this doc link to test further

created a new namespace "spark" kubectl create namespace spark
added emr-containers in config map of EKS cluster eksctl create iamidentitymapping --cluster addf-ros-image-demo-core-eks-cluster --namespace spark --service-name "emr-containers"
enabled IAM role eksctl utils associate-iam-oidc-provider --cluster addf-ros-image-demo-core-eks-cluster --approve

created new virtual cluster


aws emr-containers create-virtual-cluster \

--name to-delete \ --container-provider '{ "id": "addf-ros-image-demo-core-eks-cluster", "type": "EKS", "info": { "eksInfo": { "namespace": "spark" } } }'


5. I was able to submit spark jobs to this new virtual cluster, spark jobs were in scheduled state for 15 minutes and then they fail. 

Observation : Earlier the spark jobs were failing as soon as they were submitted (<2 seconds). In New virtual EMR cluster created with proper namespace, the jobs stay in scheduled mode for 15 minutes and then fail. In Scheduled state , the Spark's resource manager negotiates resource allocation with cluster manager, I think the communication or resource allocation between Spark's resource manager and EKS cluster is the root cause of this issue. 

PS: I had discussed the same with @kevinsoucy

dgraeber commented 1 year ago

@srinivasreddych @manojrajpurohit Sooo...I think you may have hosed your cluster when you ran the eksctl utils associate-iam-oidc-provider --cluster addf-ros-image-demo-core-eks-cluster --approve command as the ADDF cluster already has an OIDC provider. It sounds like the service account for EMR-on-EKS was not installed by the module correctly? @srinivasreddych can you look and we can circle back later?

dgraeber commented 1 year ago

Closing due to inactivity. Please reopen once eyes can focus on it

awslabs / autonomous-driving-data-framework

[BUG] PySaprk Job fails on EMR-on-EKS with message Internal Failure in retrieving data #81