Write a simple service / library that provides sharding info for user homedirs

berkeley-dsep-infra / data8xhub

Infrastructure for creating & maintaining the data8x JupyterHub

BSD 3-Clause "New" or "Revised" License

11 stars 5 forks source link

Write a simple service / library that provides sharding info for user homedirs #5

Closed yuvipanda closed 6 years ago

yuvipanda commented 6 years ago

We will have N NFS servers, each of which can serve a capacity C_n users. We need a high performance, non-racy and simple way of doing the following:

For a given user, fetch what host their homedir is
If the user is new, assign them to a specific host based on some metric (least loaded, probably)
If we perform a migration, update the user info so it directs clients to the new location

It also needs to perform at least some sort of authentication, split for reads and writes.

All this also needs to be fairly race free, and we should have a plan for backing up the data + reconstructing it from scratch if we lose it.

yuvipanda commented 6 years ago

Brainstorming this a bit more, it looks like the data structure we need to maintain looks like a min heap of sets. Each set represents a shard, and we can always add new users to the top set (since it has the smallest number of users currently). To find where a user is, we would have to do a O(n) operation, checking for set containment across all the objects. However, n is usually small (maxing out at a few hundreds?) and I think the simplicity of this design makes it more worthwhile.

I strongly suspect that I can turn this into a simple RDBMS design that I can then implement on postgresql.

yuvipanda commented 6 years ago

A single table design might be enough, with the following columns: id, name, shard.

To add a new item, we simply do something along the lines of:

INSERT INTO items(name, shard) VALUES(:name, SELECT shard FROM items GROUP BY shard ORDER BY count(shard) LIMIT 1)

We can then read the value of the shard. This also lets us use the transactional capabilities of postgres (or sqlite or anything) to prevent races. I believe we can index this properly so this is quite fast.

Finding out which shard an item belongs to is also a trivial SQL query now.

This is much simpler than my previous comment about datastructures :D

yuvipanda commented 6 years ago

Also note that with this we no longer set 'capacity' for a shard. All shards are counted to be unlimited capacity, but we will always only assign new users to the shard with the smallest number of items - thus equalizing the number of users per shard over time. This also means that if we run out of capacity, users might get assigned to shards that are already 'full' - which might be quite bad if not planned for properly.

yuvipanda commented 6 years ago

In our config, we'll hook into this from the spawner to figure out where to place the user.

yuvipanda commented 6 years ago

Also this can totally be a python library rather than a service! Since all the locking is done at the db level, that'd be totally fine...

yuvipanda commented 6 years ago

wheee, I've a sharder object now!