JuliaCloud / JuliaBox

Juliabox continues to run, but this codebase is no longer current.
http://www.juliabox.org/
Other
185 stars 50 forks source link

RFC: Prevent data loss due to disk quotas #390

Open tanmaykm opened 8 years ago

tanmaykm commented 8 years ago

There's a subtle bug in JuliaBox that can cause loss of data when the user's disk storage is near the alotted quota.

JuliaBox today needs to create and/or update certain files in the user's home folder:

To restore the user's data:

  1. a blank disk is first primed with the above files
  2. data from user's backup are applied on it
  3. some files are updated (I think only .bashrc)

Steps 2 and 3 above can fail if the storage used in step 1 is more than what it was when the user backed up their data last. That can happen between JuliaBox releases. And when the incompletely restored data is backed up, it overwrites the last good backup.

So, I think the primary reasons for this issue is JuliaBox having to share disk space with user data. Keeping more than one backup will help restore data, but that's an added safety feature. Below are a few thoughts to address this:

  1. Separate mounts for user data and user home (larger data volume at /data, smaller volume at /home/juser)
    • user home is not backed up
    • data disk gets backed up
    • link some essentials from user home to mounted volume (e.g. .bashrc, .ssh, ...)
    • :+1: Faster boot time, as /data can be restored in async
    • :+1: User (with appropriate privilege) can keep multiple disks and choose one to mount at run time
    • :+1: Going by this analogy EBS volumes can be mounted at /data/disk-1 and such
    • :-1: Inconvenience of having a small home folder
    • :-1: Can leave some file on the home folder and forget that it is ephemeral
  2. Separate mounts for user data and user home (larger data volume at /data, docker container filesystem for /home/juser)
    • similar to above, except that /home/juser is not a separate mount
    • :+1: simpler in operation as one less mount point
    • :-1: can't enforce a limit on home folder size
    • :-1: container filesystem is slower, so writing to home folder will be slower
  3. Allocate single disk of size (quota + (reserved space for JuliaBox use)), enforce quota someway
    • reserve large enough additional space for JuliaBox use
    • enforce by alerting user after periodic checks and during backup
    • :-1: no easy way to have the OS enforce limits
    • :-1: not clear how often should JuliaBox be monitoring space usage
    • :-1: not clear what to do if the user ignores quota messages or session gets disconnected
  4. Use linux quota with docker user namespaces
    • have a virtual uid on host system for each docker container, and map it to juser in the corresponding container
    • setup quota on host machine for the virtual uid to control allocation
    • :+1: simpler, no need to use loopback volumes
    • :-1: the real uid can become different on each log in, difficult to manage file permissions.

Approach 1 or 2 looks the best to me. Any other ideas are welcome.

tanmaykm commented 8 years ago

421 fixes this to a good extent. But the loopback disk plugin with object store backup still has practical size limitation.

Using GlusterFS with the JuliaBox hostdisk plugin could be a good solution for large amounts of reliable and quite fast data storage. I found it good and responsive when tried on a small test setup. GlusterFS also supports folder level quotas and user serviceable snapshots.

The AWS equivalent EFS is easier to provision and manage. It does not have snapshots or folder level quota and access is restricted from AWS-VPC.

It will be great to hear experiences from anyone who has used GlusterFS/EFS.

tanmaykm commented 8 years ago

Summarizing the possible storage types:

disk type plugin attach IO cost size
local disk hostdisk fast fastest low large
object store (S3, GCS) loopback slow fastest low small
block store (EBS) vol_ebs slow fast high very large
network disk (GlusterFS, EFS) hostdisk fast fast high unlimited

GlusterFS/EFS can be mounted on multiple instances/containers (useful in sharing or running distributed applications).

ViralBShah commented 8 years ago

cc @aviks