EclairJS / eclairjs-node

Node.js API for Apache Spark with Remote Client
Apache License 2.0
356 stars 38 forks source link

Session Management #14

Open Brian-Burns-Bose opened 8 years ago

Brian-Burns-Bose commented 8 years ago

EclairJS node instances need the ability to control how many kernels are created, which kernel they connect to, and be able to determine if api calls were already executed in a kernel.

Scenarios

  1. If x number of instances of a node application are created they may want to all connect to the same kernel or connect to separate kernels. If the instances are configured to connect to the same kernel, they may have to know whether any startup code has already been executed by that kernel. For example if a node app starts a spark streaming context, that streaming context can only be started once.
  2. If a kernel dies, the node app needs to be able to launch a new kernel and re-execute it's spark startup code.

    Questions

How do we determine if api calls have already been executed in a kernel? We could check to see if a SparkContext has been created?

How do we determine that a kernel id is associated with a particular node app? kernel id's are generated by the kernel gateway. We don't have control over them?

doronrosenberg commented 8 years ago

There are some other questions as well - should a require("eclairjs") object be bound to only one kernel? Currently it is, and we create the kernel connection as the user loads in the eclairjs module.

I would say no - there should be a formal api (require("eclairjs").create() for example) that creates a kernel connection and the user should be able to retrieve the kernel id and use that during a create() call to connect to an existing context. Note that this would break stuff (for example our variable name creation code has no idea of existing variables and would start from 0 again, causing conflicts).

doronrosenberg commented 8 years ago

Elaborating on the above discussion:

What a node application that uses EclairJS needs:

In regards to session management, the best solutions seems to be:

This would require Eclairjs-node to change - require("eclairjs") would now return a object with several methods on it (create() and connect/stop/restart/status(uuid)) and then only during create/connect would it return an instance of the Spark api, bound to a specific kernel.

Brian-Burns-Bose commented 8 years ago

A couple of thoughts on this with the api. We are really talking about Kernel management and SparkContext management. We need to define the SparkContext to Kernel relationship then we can make some decisions there after. Talking about the Kernel to NodeJS developer will get confusing. In our case SparkContext really means a remote kernel with a SparkContext reference. Let's focus this discussion on SparkContext rather than Kernel. SparkContext can take an app name as a parameter. Let's use this as the identifier.

We don't have control over kernel id creation so how do we map kernel id to SparkContext application name?

Brian-Burns-Bose commented 8 years ago

Some information regarding the notebook server's session api. The session api is meant to be used for session management of a particular notebook. Where notebook is associated with a "path" / file name. If you send a POST to /api/sessions with a notebook path of "foo" and a kernel spec name, a kernel will be created for this session. We could use notebook path as our SparkContext app name. Essentially what we are doing here is treating SparkContext as a notebook session. The issue is the kernel gateway doesn't implement /api/sessions, but we could fix that.

doronrosenberg commented 8 years ago

One possible improvement from an API standpoint would be to allow passing in our own identifier during Kernel creation that we could query for later (in connectToKernel calls for example). This would eliminate the need for Toree users to store the kernelid somewhere or have to query the running Kernels. In our case that could now easily be the SparkContext name.

This would have the additional benefit of allowing one to figure out what program a Kernel is executing without having to look into the Kernel's executed code.

doronrosenberg commented 8 years ago

One thing I am seeing with the sessions api is that it always creates a python kernel for me.