JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
199 stars 76 forks source link

Change Deployment to StatefulSet #20

Closed lucasces closed 5 years ago

lucasces commented 5 years ago
lucasces commented 5 years ago

Here is the PR with the statefulset change I promised you @jahstreet . I kept the old service and it's still working when you have a single Livy pod but a problem will emerge, with or without the PR, if you use it with more than one replicas as Livy does not support it either. With this PR you can use the address http://livy-N.livy-headless.default.svc.cluster.local:8998 inside de k8s cluster or create a service and ingress for each replica manually to overcome this problem.

jahstreet commented 5 years ago

@lucasces How have you tested it? I wonder if you tried to run multiple interactive sessions on the same Livy pod at the same time to be sure that port 10001 of the second session thread is reachable either. Feels like you can access 10000 port with STS pod DNS because it is exposed on Docker image level (if you haven't used your own image).

What is about scaling Livy, I don't see benefits of using StatefulSets here. Anyway you need to setup other infrastructure components (like JupyterHub, or any scheduler) to access another replicas of Livy and somehow orchestrate them externally. And when it comes to sharing cluster resources between Livy replicas it may cause definite problems as well. I would prefer deploying Livy servers to separate namespaces and access Livy servers like livy..svc:.

Have you got any specific case along that PR rather than just simplifying ports exposure?

lucasces commented 5 years ago

@jahstreet we are running our test cluster since late May and it runs 93 jobs daily with up to 4 jobs in parallel using interactive sessions. Many of the jobs are small an quick and others take as much as 90 minutes to finish. The workers per job ranges from 10 to 25. I used custom images for spark based on yours and the default for Livy, and looking into the drivers' logs I see these messages indicating that they can talk to others ports than 10000: 2019-08-21 09:37:37 19/08/21 12:37:37 INFO RSCDriver: Connecting to: livy-0.livy.datascience.svc.cluster.local:10001 2019-08-21 09:27:08 19/08/21 12:27:08 INFO RSCDriver: Connecting to: livy-0.livy.datascience.svc.cluster.local:10002

No other case other than just simplifying ports exposure. The scaling part was just a sidenote. People may eventually try to increase the replicas number and things will not work as espected with Deployments and StatefulSets alike.

jahstreet commented 5 years ago

I've created 3 parallel interactive sessions to Livy and they all were binded to the same Livy RPC Server port 10000. Though everything looks like work correctly just keep that case in mind (I feel that it was so because existing sessions were idle when I requested a new one). I haven't expected such behaviour before. If you notice something related to that case please share your insights. If you create new session while others are still active different port gets binded.

jahstreet commented 5 years ago

@lucasces Please check and confirm my changes

lucasces commented 5 years ago

I've created 3 parallel interactive sessions to Livy and they all were binded to the same Livy RPC Server port 10000. Though everything looks like work correctly just keep that case in mind (I feel that it was so because existing sessions were idle when I requested a new one). I haven't expected such behaviour before. If you notice something related to that case please share your insights. If you create new session while others are still active different port gets binded.

I've did not payed much attention to the RPC port, but can do for now on. We had some busy port issues using the same approach for livy on hadoop, but I realy don't known if its related and they are gone after upgrading to livy on kubernetes.