JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
198 stars 76 forks source link

Ques: Does this install spark #53

Open daddydrac opened 3 years ago

daddydrac commented 3 years ago

Does the helm chart for Livy deploy Spark? If not, how do we configure Spark helm chart and Livy helm chart so they can "talk" to each other/submit jobs via REST services?

jahstreet commented 3 years ago

Hi @joehoeller , thx for the question. Basically there are 2 Livy job submission modes: Batch and Interactive. For Batch mode the communication flow is the following:

  1. Livy talks to Kubernetes API to request Spark Driver
  2. Spark Driver talks to Kubernetes API to request Spark Executors and resolves them once created to communicate tasks and track the status/progress
  3. Livy resolves Spark Driver and Executrors via Kubernetes API and tracks their statuses So no direct interaction of Livy and Spark in this mode.

For interactive mode:

  1. Livy talks to Kubernetes API to request Spark Driver with the specific entrypoint JAR (for the interactive mode) and spins up the RPC server asynchronously waiting for the client registration request
  2. Spark Driver talks to Kubernetes to request the executors and calls Livy RPC server to register and share its own RPC server endpoint
  3. Livy communicates with Spark Driver via their RPC servers

Note: both modes works in basically the same way on Yarn, the only replacement here is that we use Kubernetes as the resource manager instead.

Does that answer your question? Please let me know if you would like to know more about any specific part of these flows. Best.

daddydrac commented 3 years ago

Wow, that’s a really complete answer, thank you for that.

So, that leads me to my next 2 Ques:

How can I expose the necessary nodePorts to expose the UI?

Is it possible to communicate RESTfully and send jobs via notebook?

On Thu, Dec 17, 2020 at 12:39 AM Aliaksandr Sasnouskikh < notifications@github.com> wrote:

Hi @joehoeller https://github.com/joehoeller , thx for the question. Basically there are 2 Livy job submission modes: Batch and Interactive. For Batch mode the communication flow is the following:

  1. Livy talks to Kubernetes API to request Spark Driver
  2. Spark Driver talks to Kubernetes API to request Spark Executors and resolves them once created to communicate tasks and track the status/progress
  3. Livy resolves Spark Driver and Executrors via Kubernetes API and tracks their statuses So no direct interaction of Livy and Spark in this mode.

For interactive mode:

  1. Livy talks to Kubernetes API to request Spark Driver with the specific entrypoint JAR (for the interactive mode) and spins up the RPC server asynchronously waiting for the client registration request
  2. Spark Driver talks to Kubernetes to request the executors and calls Livy RPC server to register and share its own RPC server endpoint
  3. Livy communicates with Spark Driver via their RPC servers

Note: both modes works in basically the same way on Yarn, the only replacement here is that we use Kubernetes as the resource manager instead.

Does that answer your question? Please let me know if you would like to know more about any specific part of these flows. Best.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JahstreetOrg/spark-on-kubernetes-helm/issues/53#issuecomment-747243593, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHVQHC4MAMYWH7RDRR23ULSVGRRVANCNFSM4U6VQZMQ .

jahstreet commented 3 years ago

Awesome, happy to help.

To answer the rest 2 questions I would suggest you to give a try to this step-by-step installation guide on Minikube. This will show you how to spin up the required components and have JupyterHub with Jupyter notebooks per user exposed externally from the Kubernetes cluster as well as direct access to Livy UI with the links to Spark UI.

Also some design details can be found here.

In short: to setup the Jupyter -> Livy -> Spark communication Sparkmagic is used. To expose the component endpoints Nginx Controller backed Ingresses are used.

jahstreet commented 1 year ago

Hi @salinaaaaaa , is that issue fixed for you?

Almenon commented 1 year ago

Hi @jahstreet , so if I understand correctly, in interactive mode (the mode you would probably use w/ sparkmagic & jupyter notebook), Livy would:

  1. ask Kubernetes to create a driver job
  2. The driver job would create a pod
  3. Driver asks kubernetes to create executor pods
  4. Driver kills executor pod when they finish
  5. When driver finishes, the pod automatically dies as it is a Kubernetes job.
jahstreet commented 11 months ago

Hi @Almenon , you're almost correct. If we speak about interactive mode:

1 ask Kubernetes to create a driver PODand establish web-socket connection between Livy and Spark driver rpc server ... 5 When interactive session completes Livy deletes the driver pod, which triggers deletion of executor pods referenced to driver pod

Almenon commented 11 months ago

Thanks! Sorry for the basic question, but what does it mean when you say the "interactive session completes"? in this example, would the session complete when the function get finishes?

double pi = client.submit(new PiJob(samples)).get();

For context I'm a DevOps engineer, not a spark developer, so this stuff is new to me 😅