Closed rdhara closed 4 years ago
Hi,
I encountered the same error, in my case I have just added this to the custom-values-local.yaml file :
livy:
fullnameOverride: livy-server
...
Looks like the backend of Livy's ingress mismatch the pod that running. So by forcing it to match things works again.
@KnutFr Hmm interesting will give this a try, thanks!
@rdhara , currently (3.0.X version) Livy K8s service name is already overriden in spark-cluster
Helm chart values.yaml:
livy:
service:
name: livy-server
So you shouldn't be required to override it to make the livy-server
discoverable.
@rdhara , could you please also share Spark Driver Pod logs (if available)? Please also check if Driver and Executor pods have been created in K8s cluster to be sure that it is Livy-to-Spark connectivity issue or not. Ideally would be nice to see the step by step guide on how to reproduce your issue, then I could give it a try on AWS.
Thank you for your response @jahstreet - by the way this repo is a fantastic resource! The driver never exits the pending state so there are no logs. I also don't see any executor pods.
Here are my steps:
Create an EKS cluster using k8s version 1.18. The cluster's endpoint access is set to "Public and private" and there is one node group with one t3.large instance. Configure kubectl
to point to this cluster using
aws eks --region region update-kubeconfig --name cluster_name
.
Using Helm 3, run the following:
helm repo add jetstack https://charts.jetstack.io
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart
helm repo add loki https://grafana.github.io/loki/charts
helm repo add jahstreet https://jahstreet.github.io/helm-charts
helm repo update
Then run kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v0.15.2/cert-manager.crds.yaml
Clone the spark-on-kubernetes-helm
repo locally. Modify the values.yaml
file in cluster-base
by replacing all instances of my-cluster.example.com
with k8s.mydomain.io
. My exact code can be found in this fork. Then install the modified chart with helm upgrade --install cluster-base charts/cluster-base --namespace kube-system
. Now running kubectl get service cluster-base-ingress-nginx-controller --namespace kube-system
should display the public DNS for a Classic Load Balancer.
Add a CNAME record in Route 53 pointing k8s.mydomain.io
to the load balancer's public DNS address.
Perform the same replace operation in the spark-cluster
chart for the custom-values-local.yaml
and custom-values-example.yaml
(the latter not being necessary) files. Comment out the last line in the chart (kubeVersion: 1.11.0 - 1.18.9
) for EKS compatibility. Apply the chart with
helm upgrade --install spark-cluster --namespace spark-cluster ./charts/spark-cluster -f ./charts/spark-cluster/examples/custom-values-local.yaml
. You may have to run kubectl create namespace spark-cluster
before applying the chart.
Go to k8s.mydomain.io/jupyterhub
, sign in, and attempt to run the sample Python notebook. You may have to override your browser's security settings to even access the page (likely due to the self-signed certificate); for instance in Chrome, type thisisunsafe
while on the security error page and hit Enter.
Also one more random question - Spark 3.0 seemingly natively supports Prometheus without the need for JMX sinks or pushgates; curious if this is something you've considered for this repo as it would likely simplify the charts.
Please let me know if I can clarify any details - thanks again!
@rdhara , I guess your cluster has RBAC enabled. Could you please share Livy logs to double check?
In case this is true then you also need to configure Livy and Spark Driver ServiceAccount
s with the appropriate privileges. To do that automatically it should be enough to provide --set livy.rbac.create=true
when installing spark-cluster
. The examples of the RBAC setup can be found in https://github.com/JahstreetOrg/spark-on-kubernetes-helm/blob/master/charts/livy/templates/rbac.yaml and https://github.com/JahstreetOrg/spark-on-kubernetes-helm/blob/master/charts/livy/templates/serviceaccount.yaml . Please let me know if it helps.
Note: the Helm chart provided RBAC configs can be used ONLY if you deploy Livy and Spark containers to the same namespace. Otherwise some modifications can be needed. Please also check this piece of yaml
to get some context: https://github.com/JahstreetOrg/spark-on-kubernetes-helm/blob/master/charts/livy/templates/statefulset.yaml#L42-L47 .
@jahstreet I tried adding the RBAC line but the result is the same - I get a timeout after 300 seconds. The driver again never exists Pending
so there are no logs, but here are the livy logs:
0/11/03 17:48:35 INFO AccessManager: AccessControlManager acls disabled;users with view permission: ;users with modify permission: ;users with super permission: ;other allowed users: *
20/11/03 17:48:37 INFO LineBufferedStream: Welcome to
20/11/03 17:48:37 INFO LineBufferedStream: ____ __
20/11/03 17:48:37 INFO LineBufferedStream: / __/__ ___ _____/ /__
20/11/03 17:48:37 INFO LineBufferedStream: _\ \/ _ \/ _ `/ __/ '_/
20/11/03 17:48:37 INFO LineBufferedStream: /___/ .__/\_,_/_/ /_/\_\ version 3.0.1
20/11/03 17:48:37 INFO LineBufferedStream: /_/
20/11/03 17:48:37 INFO LineBufferedStream:
20/11/03 17:48:37 INFO LineBufferedStream: Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_265
20/11/03 17:48:37 INFO LineBufferedStream: Branch HEAD
20/11/03 17:48:37 INFO LineBufferedStream: Compiled by user on 2020-10-03T09:46:06Z
20/11/03 17:48:37 INFO LineBufferedStream: Revision 2b147c4cd50da32fe2b4167f97c8142102a0510d
20/11/03 17:48:37 INFO LineBufferedStream: Url https://github.com/apache/spark.git
20/11/03 17:48:37 INFO LineBufferedStream: Type --help for more information.
20/11/03 17:48:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/03 17:48:38 INFO StateStore$: Using FileSystemStateStore for recovery.
20/11/03 17:48:38 INFO BatchSessionManager: Recovered 0 batch sessions. Next session id: 0
20/11/03 17:48:38 INFO InteractiveSessionManager: Recovered 0 interactive sessions. Next session id: 0
20/11/03 17:48:38 INFO InteractiveSessionManager: Heartbeat watchdog thread started.
20/11/03 17:48:38 INFO WebServer: Starting server on http://spark-cluster-livy-0.spark-cluster-livy-headless.spark-cluster.svc.cluster.local:8998
20/11/03 17:50:14 WARN InteractiveSession$: sparkr.zip not found; cannot start R interpreter.
20/11/03 17:50:14 INFO InteractiveSession$: Creating Interactive session 0: [owner: null, request: [kind: pyspark, proxyUser: Some(jupyter_user), driverMemory: 2G, executorMemory: 2G, numExecutors: 2, name: _template_python, conf: spark.kubernetes.allocation.batch.size -> 10, heartbeatTimeoutInSecond: 0]]
20/11/03 17:50:15 INFO RpcServer: Connected to the port 10000
20/11/03 17:50:15 WARN RSCConf: Your hostname, spark-cluster-livy-0.spark-cluster-livy-headless.spark-cluster.svc.cluster.local, resolves to a loopback address, but we couldn't find any external IP address!
20/11/03 17:50:15 WARN RSCConf: Set livy.rsc.rpc.server.address if you need to bind to another address.
20/11/03 17:50:15 INFO InteractiveSessionManager: Registering new session 0
20/11/03 17:50:15 INFO InteractiveSessionManager: Registered new session 0
20/11/03 17:50:18 INFO LineBufferedStream: 20/11/03 17:50:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/03 17:50:18 INFO LineBufferedStream: 20/11/03 17:50:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/11/03 17:50:18 INFO LineBufferedStream: 20/11/03 17:50:18 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO LoggingPodStatusWatcherImpl: State changed, new state:
20/11/03 17:50:20 INFO LineBufferedStream: pod name: templatepython-a8c75f758f3b21a3-driver
20/11/03 17:50:20 INFO LineBufferedStream: namespace: spark-cluster
20/11/03 17:50:20 INFO LineBufferedStream: labels: created-by -> livy, name -> driver, spark-app-selector -> spark-8b7bc0ed0efb411fbcde1c4580bcc55c, spark-app-tag -> livy-session-0-hrIaLgmT, spark-role -> driver
20/11/03 17:50:20 INFO LineBufferedStream: pod uid: 98817498-573a-4fea-a16b-8a99750dce0d
20/11/03 17:50:20 INFO LineBufferedStream: creation time: 2020-11-03T17:50:20Z
20/11/03 17:50:20 INFO LineBufferedStream: service account name: spark-cluster-livy-spark
20/11/03 17:50:20 INFO LineBufferedStream: volumes: spark-local-dir-1, spark-conf-volume, spark-cluster-livy-spark-token-8jv8h
20/11/03 17:50:20 INFO LineBufferedStream: node name: N/A
20/11/03 17:50:20 INFO LineBufferedStream: start time: N/A
20/11/03 17:50:20 INFO LineBufferedStream: phase: Pending
20/11/03 17:50:20 INFO LineBufferedStream: container status: N/A
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO LoggingPodStatusWatcherImpl: State changed, new state:
20/11/03 17:50:20 INFO LineBufferedStream: pod name: templatepython-a8c75f758f3b21a3-driver
20/11/03 17:50:20 INFO LineBufferedStream: namespace: spark-cluster
20/11/03 17:50:20 INFO LineBufferedStream: labels: created-by -> livy, name -> driver, spark-app-selector -> spark-8b7bc0ed0efb411fbcde1c4580bcc55c, spark-app-tag -> livy-session-0-hrIaLgmT, spark-role -> driver
20/11/03 17:50:20 INFO LineBufferedStream: pod uid: 98817498-573a-4fea-a16b-8a99750dce0d
20/11/03 17:50:20 INFO LineBufferedStream: creation time: 2020-11-03T17:50:20Z
20/11/03 17:50:20 INFO LineBufferedStream: service account name: spark-cluster-livy-spark
20/11/03 17:50:20 INFO LineBufferedStream: volumes: spark-local-dir-1, spark-conf-volume, spark-cluster-livy-spark-token-8jv8h
20/11/03 17:50:20 INFO LineBufferedStream: node name: N/A
20/11/03 17:50:20 INFO LineBufferedStream: start time: N/A
20/11/03 17:50:20 INFO LineBufferedStream: phase: Pending
20/11/03 17:50:20 INFO LineBufferedStream: container status: N/A
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO LoggingPodStatusWatcherImpl: Deployed Spark application _template_python with submission ID spark-cluster:templatepython-a8c75f758f3b21a3-driver into Kubernetes
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO ShutdownHookManager: Shutdown hook called
20/11/03 17:50:20 INFO LineBufferedStream: 20/11/03 17:50:20 INFO ShutdownHookManager: Deleting directory /tmp/spark-db60605f-359a-4fb0-bbc5-cfc606188033
20/11/03 17:50:32 WARN VersionUsageUtils: The client is using resource type 'ingresses' with unstable version 'v1beta1'
I should note this looks different from the livy log I had without the RBAC line - the Pastebin is linked in my original post above. So seems it had some effect?
The driver again never exists Pending
Do you mean that there are no any Driver pods or there is a Driver pod but it is always pending (I assume the latter)?
Due to AWS docs: t3.large has 2 vCPU 8.0 G Mem
. Just to compare, I usually launch local Minikube for testing with 12 CPU and 14 G Mem
, otherwise there might me lack of resources to launch 1 Driver + X Executor job. In case my assumption that there is a Driver pod but it is always pending
is right, I think your cluster just has not enough CPU (each Spark container requires 1 CPU by default).
This can be verified by executing: kubectl describe pod spark-driver-pod
. On the bottom in the events
section might be useful error messages.
@jahstreet That indeed was the problem. I used a cluster of 2 c5.2xl instances and was able to get the charts working. Thanks a lot for the help! Let me know if there are plans to migrate to the new Spark 3.0 native Prometheus setup, I can try helping with that.
Hello,
I'm working through the full instructions and am running into an issue accessing Livy. My k8s control plane is managed by AWS EKS and I've configured a Route 53 CNAME record for
k8s.mydomain.io
to point to the Classic Load Balancer that gets spun up by AWS when I install thecluster-base
config. Because I'm not deploying locally, I've replaced all instances ofmy-cluster.example.com
withk8s.mydomain.io
. I do get security errors when accessing this page via the browser but I'm assuming this is just because k8s is using a self-signed certificate.I then installed
spark-cluster
and confirmed all the pods (includinglivy
) were successfully running:I am able to go to
k8s.mydomain.io/jupyterhub
, sign in, and launch the Python example notebook. But the Spark application never seems to start and after 5 minutes I getWhen I try to go to
k8s.mydomain.io/livy
, I get an Nginx error page that says 503 Service Temporarily Unavailable. All the metric dashboards inspark-monitoring
also don't work - I can access the pages but no metrics are available. I suspect all these problems likely have a singular root cause; there is something I'm missing here.Additional information that might be useful:
values.yaml
file incluster-base
and thecustom-values-local.yaml
file inspark-cluster
, both with the domain replacement mentioned abovet3.large
, which should probably sufficient to run a PySpark notebookAny help would be greatly appreciated - thanks in advance!