att / rcloud

Collaborative data analysis and visualization
http://rcloud.social
MIT License
429 stars 141 forks source link

Support heterogenous back ends with notebook providing hints (e.g. require GPU support) #2714

Open s-u opened 4 years ago

s-u commented 4 years ago

Some notebooks may be using specialized facilities (such as GPUs) which are available only in some of the back-end nodes. We should provide a way where the notebook can store metadata requesting particular traits which will make RCloud pick particular host(s) for back-end.

Currently, the dispatch to nodes (i.e., websocket connection) is done by a single entry point (URL) which does not have any additional metadata. The client does support arbitrary URLs, but the dispatch logic would have to be passed somewhere. The idea is to fetch the notebook metadata first, and then based on that select the URL for the session.

s-u commented 4 years ago

@gordonwoodhull I'll need your support on this - this is mainly a font-end issue since the front-end code has to do all the work to find out which URL to connect to for the session. We have essentially two options:

  1. open a session just to call an RPC to find the notebook metadata, then close it and open a new session (or if the URL would be the same just continue with the current one).
  2. use a REST API to pull the the metadata from some /foo.R script and then initiate the RCloud client with the corresponding URL

Technically, even 2. may implicitly have to create a session, it would just be local. It's unclear if either is preferable from a technical standpoint. 2. is maybe more simple (could just be an AJAX call), 1. is more flexible. In either case the bad news is that we definitely need an extra round-trip so it will impact load performance.

s-u commented 4 years ago

Also note that this depends on #2713 (unless we use some cell-code magic - probably not a good idea, though)

gordonwoodhull commented 4 years ago

Somehow I thought that control would stay on the main server and only compute would go to the notebook node.

I don't think the client could handle changing control sessions when you click to a new notebook in the tree.

s-u commented 4 years ago

It already does that - the client creates a new WebSocket connection whenever the notebook is changed. The control/compute separation is not visible to the client, since they are just OCAPs on a single websocket connection, so it's actually easy to switch the server for each notebook.

This is where the URL is currently used: https://github.com/att/rcloud/blob/570ee40dbed67cbf8e44d3501b6e1a8d955044e4/htdocs/js/session.js#L191

But the challenging piece is that at the point we don't even know that a notebook is accessed - we just want to create a session for any purpose. Originally, I was just thinking we simply change the URL above and all is well - which is actually true, except it's unclear how to find out which URL to use...

What you mention is a much more challenging alternative - it would require the control process to connect to another server and then proxy the traffic from that server into its own WebSocket connection. The way we are running RCloud today it's not possible as you cannot connect to the QAP process on another machine (for security reasons among others), and even less to replace part of its OCAPs with your own. So if we want to consider that route, it would require a lot more thinking...

gordonwoodhull commented 4 years ago

Ahhh. Okay, I remember.

I have had this confused for a long time. This is the cause of bugs if extensions do not replace their ocaps when both sessions are replaced.

Option 2 sounds possibly faster because the restful call only initiates a basic session, whereas option 1 would create all the RCloud ocaps, ship em over, unpack them, make one call, then disconnect.

Something like 5 round trips for option 1 versus 3 for option 2, where currently we only need 2 to get to that state.

gordonwoodhull commented 4 years ago

We would like to default to round-robin or other load-balancing, but savvy users could specify a machine (or type of machine) on which they want the notebook to run.

I wonder if this could be supported by having a specific WS URL on the gateway/master which allows nginx to proxy for load balancing, but connect WS directly to a particular compute node if specified.

gordonwoodhull commented 1 year ago

If we added something that waited for a node to boot up, this would have a cost model pretty much like Databricks clusters.

You would have one default node that is wimpy and cheap, so that scripts and dashboards always run.

Then users would select a beefier compute node to do serious data analysis. That node could be shared and would shut down after idle for some amount of time.

s-u commented 1 year ago

That is a good idea and also it plays well with the approach above, because the first step that determines the back-end could actually spin it up and wait and only then return and connect. Little UI could help with the wait time.

One thing I'm noticing here is that this may not necessarily be tied to a notebook. It could be, but doesn't have to be, where the user has some way of specifying where the next session would go. I don't have anything in mind yet and it would require some GUI, but just conceptually the original issue was about the notebooks defining the back-end, but that's not a necessary requirement.