Decide on LoadBalancing strategy for multiple hubs

yuvipanda commented 6 years ago

It looks like we'll end up with 30-40 hubs at max for this hub, so figuring out a load balancing strategy is important.

Requirements:

Load Balancing users across hubs, not shard. The user homedir is really the only persistent storage we care about, and that is explicitly shared outside. So hubs should be load balanced across sessions.
Sticky sessions - once a particular hub is chosen for a session, it should be used until the user session is over. Hubs themselves are stateful, so we can't load balance at the level of each user request - it has to be at the level of each user session. This means that the authentication needs to be at the level of the proxy, and the proxy needs to be aware of user authentication information to some extent.
If a user closes their laptop and opens it in an hour, their notebook should sortof automagically come back to life in most cases. This is a problem because this means the user's user pod has been culled, but somehow when a new request comes in it gets the hub to trigger spawn and route correctly. This already works for the single hub case, and we should try make it work for the multiple hub case too! This I think is what constrains us most.

yuvipanda commented 6 years ago

Some form of consistent hashing should probably give us what we want, I suspect. Gonna read up.

yuvipanda commented 6 years ago

Because of how LTI works, the only unauthenticated endpoint the proxy should allow is a POST to /lti/launch, and this should go to our central dispatcher app.

This dispatcher should do a bunch of things, but central to it is:

Authenticate the LTI request
Decide on a hub to send the user to
Set a signed cookie for the whole domain that points to which hub the user should go to
Make a JWT for the user and redirect them to the loadbalancer, with appropriate URL
Hub logs them in, setting its own cookies that it can then read

This has the following consequences:

requirement (1) and (2) are trivially satisfied
requirement (3) is satisfied as much as possible when the user comes back on the same device - they already have a routin cookie and a login cookie, and the user still exists on the hub. Currently, hubs have performance issues if there are too many users on them, so it might cap our max hub sizes.
When the user comes in from a new browser, the only way for them to log in is via the LTI URL, at which point our central dispatcher will already know where to send the user!

The sharder we built for the NFS storage situation automatically rebalances itself, so we might do something similar for here too! However we might need to do some performance optimizations there to make sure we're not doing too many SQL queries.

This should give us a complete load balancing setup with dynamically changeable hub counts for an LTI Based setup.

yuvipanda commented 6 years ago

https://github.com/berkeley-dsep-infra/data8xhub/commit/0dd75fc770766bc9a5822fb21aaa0a7919b90a36 takes a first stab at this.

The edge proxy will only be routing to hubs based on secure routing cookie, and not do anything else.

This should work as is to begin with, but in the long run pose plenty of problems wrt balancing. We need an edge-user-assigner, which can authenticate LTI requests and direct people to an appropriate hub based on its internal knowledge of the users on each hub.

yuvipanda commented 6 years ago

We now need a loadbalancing assigner of sorts. This should:

Keep a running count of number of active servers per hub
Route new users to least loaded hub
Alert when all hubs start getting full!

berkeley-dsep-infra / data8xhub

Decide on LoadBalancing strategy for multiple hubs #8