feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.55k stars 993 forks source link

PySpark Launcher raises SparkJobFailure #1347

Closed beatgeek closed 3 years ago

beatgeek commented 3 years ago

Expected Behavior

Running this line from the feast-kubeflow notebook: output_file_uri = job.get_output_file_uri() Expect job to run

Current Behavior

---------------------------------------------------------------------------
SparkJobFailure                           Traceback (most recent call last)
<ipython-input-24-2150b25e4f35> in <module>
----> 1 output_file_uri = job.get_output_file_uri()

~/.local/lib/python3.6/site-packages/feast/pyspark/launchers/k8s/k8s.py in get_output_file_uri(self, timeout_sec, block)
    126             return self._output_file_uri
    127         else:
--> 128             raise SparkJobFailure("Spark job failed")
    129 
    130 

SparkJobFailure: Spark job failed

Steps to reproduce

Following this notebook - Feast on Kubeflow Notebook

Client settings:

client = Client(
    core_url="feast-release-feast-core.feast.svc:6565",
    serving_url="feast-release-feast-online-serving.feast.svc:6566",
    redis_host="feast-release-redis-headless.feast.svc",
    historical_feature_output_location=f"{staging_bucket}historical",
    spark_launcher="k8s",
    spark_k8s_namespace="spark-operator",
    spark_staging_location=f"{staging_bucket}"
)

Specifications

Possible Solution

I've update RBAC permissions so the job seems to be set. I suspect this is still a permissions issue.

woop commented 3 years ago

Hey @beatgeek,

Thanks for raising this. Do you have a spark service account in your namespace?

That notebook is used over here: https://github.com/kubeflow/manifests/pull/1733

Please note the ClusterRole creation that is necessary. I'm not sure if that is related to the problem you are experiencing.

beatgeek commented 3 years ago

Yes I do have a spark Service Account in my namespace. I did see the settings mentioned in Manifest #1733 I'm not clear on the instructions: kubectl edit clusterrolebinding spark-operatorsparkoperator-crb . There is no such crb.

I've created two sets of rolebindings for spark and one for the operator.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: my-nms
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: my-nms
  name: spark-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["*"]
- apiGroups: [""]
  resources: ["services"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-role-binding
  namespace: my-nms
subjects:
- kind: ServiceAccount
  name: spark
  namespace: my-nms
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

and

kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: sparkop-release-spark-operator 
  namespace: spark-operator
rules:
- apiGroups: ["sparkoperator.k8s.io"]
  resources: ["sparkapplications"]
  verbs: ["create", "delete", "deletecollection", "get", "list", "update", "watch", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  name: sparkop-release-spark-operator 
  namespace: spark-operator
roleRef:
  kind: Role
  name: sparkop-release-spark-operator
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: default-editor
    namespace: my-nms
  - kind: ServiceAccount
    name: spark
    namespace: my-nms

I was able to change the connection to a Dataproc cluster and get it to run. It does, however, seem to have an issue with some syntax in the python job.

beatgeek commented 3 years ago

Update here - when I used spark-launcher="dataproc" I've having success but seeing a Sparkjob failure coming from the job code.

Job failed with message [SyntaxError: invalid syntax]

The example shows python3.7 and the out-of-the-box notebook server runs python3.6. I'll confirm but this is likely the issue.

paravatha commented 3 years ago

@beatgeek I am seeing similar issue. When I looked at the spark driver logs. Its complaining about "projectId" Looking at the stack trace, it seems like the spark driver pod is trying to create path under storage bucket and throwing error saying "projectid cannot be null"

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.