berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
63 stars 38 forks source link

Implement data archival policy for home directories #1633

Open yuvipanda opened 4 years ago

yuvipanda commented 4 years ago

We theoretically archive home directories of users who haven't touched it in 12 months, based on this policy - https://docs.datahub.berkeley.edu/en/latest/users/storage-retention.html.

We don't do that yet, but we should! I want this to be a fully self-serve setup - so students should be able to acquire & unzip their home directories whenever they choose to, without having to bug us.

Hub Total Homedirs Archiveable Homedirs % Total Size Archievable Size %
datahub 21926 6827 31% 7.5T 1.8T 23%
data100 5708 1340 23%
prob140 2326 547 23%

Other hubs are new enough that I don't think it's worth doing anything about them.

yuvipanda commented 4 years ago

I have a small script that tells me how many users' home directories can be archived.

We have a total of 21926 home directories in datahub homes, of which only 6827 meet criteria for archival. With a 6 month marker, it becomes 12465, but let's not do that yet.

yuvipanda commented 4 years ago

For data100, from a total of 5708 home directories, we'll have to keep 4368.

yuvipanda commented 4 years ago

I'll just update the table above with numbers for different hubs.

yuvipanda commented 4 years ago

412GB is 'other hubs', including 100GB just for EECS hub. I filed https://github.com/berkeley-dsep-infra/datahub/issues/1635 to have EECS get its own disk. I think we should probably have 1 other disk that takes everything right.

felder commented 4 years ago

@yuvipanda it'd be good to note what the actual compression ratio of the compressed files is.

yuvipanda commented 3 years ago

I'm picking this back up now. Surprisingly, I never put up the script I used to calculate these numbers last time. I've put it up this time - https://github.com/yuvipanda/homedir-archiver. I'm running this to find users to archive, where a user is archivable if they have not modified any file for the last 12 months, starting from today. Some hubs are too new to have any user files over 12 months, and I've just put a - for them

I'm updating this table as I run the script. I initially ran it under ionice to not disrupt regular operations, but am now running it without ionice - iowait seems ok.

Hub Total size (GB) Archivable (GB) Savings %
datahub (+ r hub) 15,216 4,094 27%
data100 7,903 2,277 28%
data102 238 51 21%
eecs 2,073 - -
cs194 78 - -
stat159 13 - -
biology 891 - -
dlab 338 - -
prob140 205 138 67%
workshop 34 25 71%
julia 26 - -
total 27,015 6,585 24%
yuvipanda commented 3 years ago

My suggestion now is that we:

  1. Tar these up, and store them in google cloud storage. We'll transition these to the archival tier once we're sure this works - for 6,585 GB it'll cost about 8$ a month. I'm sure it'll go down with compression.
  2. Leave a note asking users to email a list (ds-infrastructure probably?) if they need their files. I suspect this will be a pretty small group, since we're explicitly only doing this for users who have not logged in for one full year. If the amount of requests increase, we can work on a more automated solution.

This will need to be done carefully to make sure we do not lose any data.

I also want to see what happens if I make the cutoff to be 'just after spring 2020' rather than '12 months before today' - I suspect we can save more space. The overall plan remains the same though.

Once we've cleaned this up, we'll create newer, smaller disks, copy the files over and the NFS server to use them.

felder commented 3 years ago

@yuvipanda we should consider what the overall retention policy for these will be or perhaps consider object lifecycle management on the bucket to move items to cheaper long term storage.

yuvipanda commented 3 years ago

yeah, I think we should start with standard storage and move it to archival storage in a week or two.

I think my current plan would be:

  1. Tar and push inactive home directories to GCS, using md5 to ensure no corruption during data transfer
  2. Do another round, tar and just verify MD5 to make sure everything made it ok. Our tarballs should be reproducible for this to work.
  3. Take a disk snapshot for additional (temporary) protection against human error
  4. Delete contents of all inactive home directories, alongside another (2) to make sure we aren't deleting things we don't have backups of
  5. Add a note to the home directories with instructions on how to ask for this data.
yuvipanda commented 3 years ago

I tested out gzip, bzip2 and xz on the workshop hub's inactive home directories.

Type Original Size Compressed Size Savings % Time taken
gzip2 25.02GB 16.25GB 35% 27 minutes
xz 25.02GB 14.75GB 41% A lot longer
bzip2 25.02GB 15.89GB 37% 142 minutes

I didn't run the xz test under time unfortunately, but it was many many hours - at least 6. This was all done with default compression levels of tar.

yuvipanda commented 3 years ago

I'm doing this now fully, so we have the clear disk space before start of semester.

I've discovered we had an old set of home directories from other hubs in the datahub home directory setup - from before the time we had separate disks for each hub.

Disk space usage:

root@nfsserver-01:/home/yuvipanda# ls /export/datahubhomes-2020-07-29/homes/ | rg '^_' | xargs -L1 -I{} du -d0 -h /export/datahubhomes-2020-07-29/homes/{}

7.7G    /export/datahubhomes-2020-07-29/homes/_buds-2020
17M     /export/datahubhomes-2020-07-29/homes/_canvas-test
4.5G    /export/datahubhomes-2020-07-29/homes/_cogneuro
13G     /export/datahubhomes-2020-07-29/homes/_data100
78G     /export/datahubhomes-2020-07-29/homes/_data102
74G     /export/datahubhomes-2020-07-29/homes/_eecs
20G     /export/datahubhomes-2020-07-29/homes/_math124
171G    /export/datahubhomes-2020-07-29/homes/_prob140
2.4G    /export/datahubhomes-2020-07-29/homes/_shared
12K     /export/datahubhomes-2020-07-29/homes/_stat131a
7.1G    /export/datahubhomes-2020-07-29/homes/_stat89a
33G     /export/datahubhomes-2020-07-29/homes/_workshop

File update times:

/export/ischool-2021-07-01/old-hub-files/:
total 156
drwxr-xr-x   72 ubuntu ubuntu  4096 Jul 13  2020 _buds-2020
drwxr-xr-x    7 ubuntu ubuntu    71 Jul  9  2019 _canvas-test
drwxrwxr-x    5 ubuntu ubuntu    89 Sep 23  2017 _cogneuro
drwxr-xr-x    5 ubuntu ubuntu    52 Jan 28  2019 _data100
drwxr-xr-x  412 ubuntu ubuntu 12288 Jul 27  2020 _data102
drwxr-xr-x  227 ubuntu ubuntu  8192 Jul 30  2020 _eecs
drwxr-xr-x   52 ubuntu ubuntu  4096 May 18  2019 _math124
drwxr-xr-x 2330 ubuntu ubuntu 61440 Jul 29  2020 _prob140
drwxr-xr-x    6 ubuntu ubuntu    78 Mar  2 10:11 _shared
drwxr-xr-x    3 ubuntu ubuntu    18 Aug 19  2019 _stat131a
drwxr-xr-x  111 ubuntu ubuntu  4096 May 28  2020 _stat89a
drwxr-xr-x  308 ubuntu ubuntu 12288 Jul 28  2020 _workshop

I've temporarily moved them over to /export/ischool-2021-07-01/old-hub-files/ so the datahub archiver can proceed (and be faster). I believe this data was already copied over when we moved to using multiple disks in our NFS server. Let's verify that and then get rid of these

yuvipanda commented 3 years ago

prob140 was just cleaned!

Active: 1803, Inactive: 1916, Inactive Uncompressed Size: 142.26, Inactive Compressed Size: 94.44gb

data100 and datahub are going to be the biggies