JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
198 stars 76 forks source link

Getting 503 error when accessing Livy #45

Closed rdhara closed 3 years ago

rdhara commented 3 years ago

Hello,

I'm working through the full instructions and am running into an issue accessing Livy. My k8s control plane is managed by AWS EKS and I've configured a Route 53 CNAME record for k8s.mydomain.io to point to the Classic Load Balancer that gets spun up by AWS when I install the cluster-base config. Because I'm not deploying locally, I've replaced all instances of my-cluster.example.com with k8s.mydomain.io. I do get security errors when accessing this page via the browser but I'm assuming this is just because k8s is using a self-signed certificate.

I then installed spark-cluster and confirmed all the pods (including livy) were successfully running:

(venv) ➜  spark-cluster git:(master) ✗ kubectl get pods --watch --namespace spark-cluster
NAME                              READY   STATUS    RESTARTS   AGE
continuous-image-puller-24wqp     1/1     Running   0          2m10s
hub-5ffb5cb6cd-6g7qd              1/1     Running   0          2m10s
proxy-84549d5bd5-8g4mr            1/1     Running   0          2m10s
spark-cluster-livy-0              1/1     Running   0          2m10s
user-scheduler-5dd7cbc579-6m6ml   1/1     Running   0          2m10s
user-scheduler-5dd7cbc579-ff727   1/1     Running   0          2m10s

I am able to go to k8s.mydomain.io/jupyterhub, sign in, and launch the Python example notebook. But the Spark application never seems to start image and after 5 minutes I get image

When I try to go to k8s.mydomain.io/livy, I get an Nginx error page that says 503 Service Temporarily Unavailable. All the metric dashboards in spark-monitoring also don't work - I can access the pages but no metrics are available. I suspect all these problems likely have a singular root cause; there is something I'm missing here.

Additional information that might be useful:

Any help would be greatly appreciated - thanks in advance!

KnutFr commented 3 years ago

Hi,

I encountered the same error, in my case I have just added this to the custom-values-local.yaml file :

livy:
  fullnameOverride: livy-server
...

Looks like the backend of Livy's ingress mismatch the pod that running. So by forcing it to match things works again.

rdhara commented 3 years ago

@KnutFr Hmm interesting will give this a try, thanks!

jahstreet commented 3 years ago

@rdhara , currently (3.0.X version) Livy K8s service name is already overriden in spark-cluster Helm chart values.yaml:

livy:
    service:
        name: livy-server

So you shouldn't be required to override it to make the livy-server discoverable.

@rdhara , could you please also share Spark Driver Pod logs (if available)? Please also check if Driver and Executor pods have been created in K8s cluster to be sure that it is Livy-to-Spark connectivity issue or not. Ideally would be nice to see the step by step guide on how to reproduce your issue, then I could give it a try on AWS.

rdhara commented 3 years ago

Thank you for your response @jahstreet - by the way this repo is a fantastic resource! The driver never exits the pending state so there are no logs. I also don't see any executor pods.

Here are my steps:

  1. Create an EKS cluster using k8s version 1.18. The cluster's endpoint access is set to "Public and private" and there is one node group with one t3.large instance. Configure kubectl to point to this cluster using aws eks --region region update-kubeconfig --name cluster_name.

  2. Using Helm 3, run the following:

    helm repo add jetstack https://charts.jetstack.io
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo add jupyterhub https://jupyterhub.github.io/helm-chart
    helm repo add loki https://grafana.github.io/loki/charts
    helm repo add jahstreet https://jahstreet.github.io/helm-charts
    helm repo update

    Then run kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v0.15.2/cert-manager.crds.yaml

  3. Clone the spark-on-kubernetes-helm repo locally. Modify the values.yaml file in cluster-base by replacing all instances of my-cluster.example.com with k8s.mydomain.io. My exact code can be found in this fork. Then install the modified chart with helm upgrade --install cluster-base charts/cluster-base --namespace kube-system. Now running kubectl get service cluster-base-ingress-nginx-controller --namespace kube-system should display the public DNS for a Classic Load Balancer.

  4. Add a CNAME record in Route 53 pointing k8s.mydomain.io to the load balancer's public DNS address.

  5. Perform the same replace operation in the spark-cluster chart for the custom-values-local.yaml and custom-values-example.yaml (the latter not being necessary) files. Comment out the last line in the chart (kubeVersion: 1.11.0 - 1.18.9) for EKS compatibility. Apply the chart with helm upgrade --install spark-cluster --namespace spark-cluster ./charts/spark-cluster -f ./charts/spark-cluster/examples/custom-values-local.yaml. You may have to run kubectl create namespace spark-cluster before applying the chart.

  6. Go to k8s.mydomain.io/jupyterhub, sign in, and attempt to run the sample Python notebook. You may have to override your browser's security settings to even access the page (likely due to the self-signed certificate); for instance in Chrome, type thisisunsafe while on the security error page and hit Enter.

Also one more random question - Spark 3.0 seemingly natively supports Prometheus without the need for JMX sinks or pushgates; curious if this is something you've considered for this repo as it would likely simplify the charts.

Please let me know if I can clarify any details - thanks again!

jahstreet commented 3 years ago

@rdhara , I guess your cluster has RBAC enabled. Could you please share Livy logs to double check?

In case this is true then you also need to configure Livy and Spark Driver ServiceAccounts with the appropriate privileges. To do that automatically it should be enough to provide --set livy.rbac.create=true when installing spark-cluster. The examples of the RBAC setup can be found in https://github.com/JahstreetOrg/spark-on-kubernetes-helm/blob/master/charts/livy/templates/rbac.yaml and https://github.com/JahstreetOrg/spark-on-kubernetes-helm/blob/master/charts/livy/templates/serviceaccount.yaml . Please let me know if it helps.

Note: the Helm chart provided RBAC configs can be used ONLY if you deploy Livy and Spark containers to the same namespace. Otherwise some modifications can be needed. Please also check this piece of yaml to get some context: https://github.com/JahstreetOrg/spark-on-kubernetes-helm/blob/master/charts/livy/templates/statefulset.yaml#L42-L47 .

rdhara commented 3 years ago

@jahstreet I tried adding the RBAC line but the result is the same - I get a timeout after 300 seconds. The driver again never exists Pending so there are no logs, but here are the livy logs:

0/11/03 17:48:35 INFO AccessManager: AccessControlManager acls disabled;users with view permission: ;users with modify permission: ;users with super permission: ;other allowed users: *
20/11/03 17:48:37 INFO LineBufferedStream: Welcome to
20/11/03 17:48:37 INFO LineBufferedStream:       ____              __
20/11/03 17:48:37 INFO LineBufferedStream:      / __/__  ___ _____/ /__
20/11/03 17:48:37 INFO LineBufferedStream:     _\ \/ _ \/ _ `/ __/  '_/
20/11/03 17:48:37 INFO LineBufferedStream:    /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
20/11/03 17:48:37 INFO LineBufferedStream:       /_/
20/11/03 17:48:37 INFO LineBufferedStream:                         
20/11/03 17:48:37 INFO LineBufferedStream: Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_265
20/11/03 17:48:37 INFO LineBufferedStream: Branch HEAD
20/11/03 17:48:37 INFO LineBufferedStream: Compiled by user  on 2020-10-03T09:46:06Z
20/11/03 17:48:37 INFO LineBufferedStream: Revision 2b147c4cd50da32fe2b4167f97c8142102a0510d
20/11/03 17:48:37 INFO LineBufferedStream: Url https://github.com/apache/spark.git
20/11/03 17:48:37 INFO LineBufferedStream: Type --help for more information.
20/11/03 17:48:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/03 17:48:38 INFO StateStore$: Using FileSystemStateStore for recovery.
20/11/03 17:48:38 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0
20/11/03 17:48:38 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0
20/11/03 17:48:38 INFO InteractiveSessionManager: Heartbeat watchdog thread started.
20/11/03 17:48:38 INFO WebServer: Starting server on http://spark-cluster-livy-0.spark-cluster-livy-headless.spark-cluster.svc.cluster.local:8998
20/11/03 17:50:14 WARN InteractiveSession$: sparkr.zip not found; cannot start R interpreter.
20/11/03 17:50:14 INFO InteractiveSession$: Creating Interactive session 0: [owner: null, request: [kind: pyspark, proxyUser: Some(jupyter_user), driverMemory: 2G, executorMemory: 2G, numExecutors: 2, name: _template_python, conf: spark.kubernetes.allocation.batch.size -> 10, heartbeatTimeoutInSecond: 0]]
20/11/03 17:50:15 INFO RpcServer: Connected to the port 10000
20/11/03 17:50:15 WARN RSCConf: Your hostname, spark-cluster-livy-0.spark-cluster-livy-headless.spark-cluster.svc.cluster.local, resolves to a loopback address, but we couldn't find any external IP address!
20/11/03 17:50:15 WARN RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
20/11/03 17:50:15 INFO InteractiveSessionManager: Registering new session 0
20/11/03 17:50:15 INFO InteractiveSessionManager: Registered new session 0
20/11/03 17:50:18 INFO LineBufferedStream: 20/11/03 17:50:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/03 17:50:18 INFO LineBufferedStream: 20/11/03 17:50:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/11/03 17:50:18 INFO LineBufferedStream: 20/11/03 17:50:18 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
20/11/03 17:50:20 INFO LineBufferedStream:   pod name: templatepython-a8c75f758f3b21a3-driver
20/11/03 17:50:20 INFO LineBufferedStream:   namespace: spark-cluster
20/11/03 17:50:20 INFO LineBufferedStream:   labels: created-by -> livy, name -> driver, spark-app-selector -> spark-8b7bc0ed0efb411fbcde1c4580bcc55c, spark-app-tag -> livy-session-0-hrIaLgmT, spark-role -> driver
20/11/03 17:50:20 INFO LineBufferedStream:   pod uid: 98817498-573a-4fea-a16b-8a99750dce0d
20/11/03 17:50:20 INFO LineBufferedStream:   creation time: 2020-11-03T17:50:20Z
20/11/03 17:50:20 INFO LineBufferedStream:   service account name: spark-cluster-livy-spark
20/11/03 17:50:20 INFO LineBufferedStream:   volumes: spark-local-dir-1, spark-conf-volume, spark-cluster-livy-spark-token-8jv8h
20/11/03 17:50:20 INFO LineBufferedStream:   node name: N/A
20/11/03 17:50:20 INFO LineBufferedStream:   start time: N/A
20/11/03 17:50:20 INFO LineBufferedStream:   phase: Pending
20/11/03 17:50:20 INFO LineBufferedStream:   container status: N/A
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
20/11/03 17:50:20 INFO LineBufferedStream:   pod name: templatepython-a8c75f758f3b21a3-driver
20/11/03 17:50:20 INFO LineBufferedStream:   namespace: spark-cluster
20/11/03 17:50:20 INFO LineBufferedStream:   labels: created-by -> livy, name -> driver, spark-app-selector -> spark-8b7bc0ed0efb411fbcde1c4580bcc55c, spark-app-tag -> livy-session-0-hrIaLgmT, spark-role -> driver
20/11/03 17:50:20 INFO LineBufferedStream:   pod uid: 98817498-573a-4fea-a16b-8a99750dce0d
20/11/03 17:50:20 INFO LineBufferedStream:   creation time: 2020-11-03T17:50:20Z
20/11/03 17:50:20 INFO LineBufferedStream:   service account name: spark-cluster-livy-spark
20/11/03 17:50:20 INFO LineBufferedStream:   volumes: spark-local-dir-1, spark-conf-volume, spark-cluster-livy-spark-token-8jv8h
20/11/03 17:50:20 INFO LineBufferedStream:   node name: N/A
20/11/03 17:50:20 INFO LineBufferedStream:   start time: N/A
20/11/03 17:50:20 INFO LineBufferedStream:   phase: Pending
20/11/03 17:50:20 INFO LineBufferedStream:   container status: N/A
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO LoggingPodStatusWatcherImpl: Deployed Spark application _template_python with submission ID spark-cluster:templatepython-a8c75f758f3b21a3-driver into Kubernetes
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO ShutdownHookManager: Shutdown hook called
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-db60605f-359a-4fb0-bbc5-cfc606188033
20/11/03 17:50:32 WARN VersionUsageUtils: The client is using resource type 'ingresses' with unstable version 'v1beta1'

I should note this looks different from the livy log I had without the RBAC line - the Pastebin is linked in my original post above. So seems it had some effect?

jahstreet commented 3 years ago

The driver again never exists Pending

Do you mean that there are no any Driver pods or there is a Driver pod but it is always pending (I assume the latter)?

Due to AWS docs: t3.large has 2 vCPU 8.0 G Mem. Just to compare, I usually launch local Minikube for testing with 12 CPU and 14 G Mem, otherwise there might me lack of resources to launch 1 Driver + X Executor job. In case my assumption that there is a Driver pod but it is always pending is right, I think your cluster just has not enough CPU (each Spark container requires 1 CPU by default).

This can be verified by executing: kubectl describe pod spark-driver-pod. On the bottom in the events section might be useful error messages.

rdhara commented 3 years ago

@jahstreet That indeed was the problem. I used a cluster of 2 c5.2xl instances and was able to get the charts working. Thanks a lot for the help! Let me know if there are plans to migrate to the new Spark 3.0 native Prometheus setup, I can try helping with that.