Closed manojrajpurohit closed 1 year ago
Is there anyway to determine what bucket it cannot access? Is it the bucket where the pyspark code resides or is it a bucket where data resides? Can you look in the logs for any more details?
One thing that I am not able to understand is, my EMR virtual cluster's EKS namespace is emr-eks-spark , refer point 1 below, whereas there is no such namespace called "emr-eks-spark" in the EKS cluster , refer point 2 below. My understanding is, EKS namespace should be created and only then EMR virtual cluster can be created (I may be wrong). But if a namespace is a hard per-requisite for EMR virtual cluster then the namespace emr-eks-spark existed at some point in time but I am unable to find any place in ADDF where EKS namespace emr-eks-spark is created.
1. describe virtual cluster to get namespace
(.venv) (base) aws emr-containers describe-virtual-cluster --id <masked>
{
"virtualCluster": {
"id": "<masked>",
"name": "addf-ros-image-demo-emr-emr-<masked>",
"arn": "arn:aws:emr-containers:<masked>:<masked>:/virtualclusters/<masked>",
"state": "RUNNING",
"containerProvider": {
"type": "EKS",
"id": "addf-ros-image-demo-core-eks-cluster",
"info": {
"eksInfo": {
"namespace": "**emr-eks-spark**"
}
}
},
"createdAt": "<masked>",
"tags": {
"Deployment": "addf-ros-image-demo"
}
}
}
2. describe all namespace
(.venv) (base) kubectl describe ns
Name: default
Labels: kubernetes.io/metadata.name=default
Annotations: <none>
Status: Active
No resource quota.
No LimitRange resource.
Name: kube-node-lease
Labels: kubernetes.io/metadata.name=kube-node-lease
Annotations: <none>
Status: Active
No resource quota.
No LimitRange resource.
Name: kube-public
Labels: kubernetes.io/metadata.name=kube-public
Annotations: <none>
Status: Active
No resource quota.
No LimitRange resource.
Name: kube-system
Is there anyway to determine what bucket it cannot access? Is it the bucket where the pyspark code resides or is it a bucket where data resides? Can you look in the logs for any more details?
I tried submitting the pyspark job by specifying the S3 Log bucket and Cloudwatch logs , refer code below, job submitted successfully, job failed with same error as stated in issue, logs were not seen in S3 neither in Cloudwatch logs. I had provided temporary elevated access to "execution-role-arn" before submitting job so it doesn't seem like IAM access issue on EMR JOb's end.
aws emr-containers start-job-run \
--virtual-cluster-id <masked>\
--name scene_detection_manual \
--execution-role-arn arn:aws:iam::<masked>:role/addf-ros-image-demo-emr-e-<masked> \
--release-label emr-6.8.0-latest \
--job-driver '{
"sparkSubmitJobDriver": {
"entryPoint": "s3://addf-ros-image-demo-artifacts-bucket-<masked>/dags/ros-image-demo/dags-aws/spark_scripts/detect_scenes.py",
"entryPointArguments": ["--batch-metadata-table-name addf-ros-image-demo-dags-aws-drive-tracking --batch-id 6_dec_2022 --bucket addf-ros-image-demo-curated-bucket-<masked>--region ap-south-1 --output-dynamo-table addf-ros-image-demo-dags-aws-scenes"],
"sparkSubmitParameters": "--conf spark.executor.instances=3 --conf spark.executor.memory=4G --conf spark.driver.memory=2G --conf spark.executor.cores=2 --conf spark.sql.shuffle.partitions=60 --conf spark.dynamicAllocation.enabled=false --packages com.audienceproject:spark-dynamodb_2.12:1.1.1"
}
}' --configuration-overrides '{
"monitoringConfiguration": {
"cloudWatchMonitoringConfiguration": {
"logGroupName": "/emr-on-eks/emr-on-eks-to-delete",
"logStreamNamePrefix": "detect_scenes_todelete"
},
"s3MonitoringConfiguration": {
"logUri": "s3://addf-ros-image-demo-logs-bucket-<masked>/emr-on-eks"
}
}
}'
Since the namespace 'emr-eks-spark' did not exists, we followed this doc link to test further
aws emr-containers create-virtual-cluster \
--name to-delete \ --container-provider '{ "id": "addf-ros-image-demo-core-eks-cluster", "type": "EKS", "info": { "eksInfo": { "namespace": "spark" } } }'
5. I was able to submit spark jobs to this new virtual cluster, spark jobs were in scheduled state for 15 minutes and then they fail.
Observation : Earlier the spark jobs were failing as soon as they were submitted (<2 seconds). In New virtual EMR cluster created with proper namespace, the jobs stay in scheduled mode for 15 minutes and then fail. In Scheduled state , the Spark's resource manager negotiates resource allocation with cluster manager, I think the communication or resource allocation between Spark's resource manager and EKS cluster is the root cause of this issue.
PS: I had discussed the same with @kevinsoucy
@srinivasreddych @manojrajpurohit Sooo...I think you may have hosed your cluster when you ran the eksctl utils associate-iam-oidc-provider --cluster addf-ros-image-demo-core-eks-cluster --approve
command as the ADDF cluster already has an OIDC provider. It sounds like the service account for EMR-on-EKS was not installed by the module correctly? @srinivasreddych can you look and we can circle back later?
Closing due to inactivity. Please reopen once eyes can focus on it
Describe the bug When we try to submit a pyspark job on EMR-on-EKS cluster, the job fails with error message "Internal Failure in retrieving data".
To Reproduce
aws emr-containers list-virtual-clusters
to get virtual cluster idExpected behavior The pyspark jobs should run successfully
Screenshots![EMR on EKS](https://user-images.githubusercontent.com/33446857/206183770-3bb04dbe-dd69-46dc-b4b8-cd3b0a8f9c37.JPG)
Additional context
This execution was actually performed for scene detection , thus you may use this PySpark code. Thus, this entryPointArguments are provided to match the execution of this script. However, this script requires previous steps of ADDF to be completed with data in respective buckets and Dynamo tables thus you may run any simple pyspark script to reproduce this error.