Implement data archival policy for home directories

yuvipanda commented 4 years ago

We theoretically archive home directories of users who haven't touched it in 12 months, based on this policy - https://docs.datahub.berkeley.edu/en/latest/users/storage-retention.html.

We don't do that yet, but we should! I want this to be a fully self-serve setup - so students should be able to acquire & unzip their home directories whenever they choose to, without having to bug us.

Hub	Total Homedirs	Archiveable Homedirs	%	Total Size	Archievable Size	%
datahub	21926	6827	31%	7.5T	1.8T	23%
data100	5708	1340	23%
prob140	2326	547	23%

Other hubs are new enough that I don't think it's worth doing anything about them.

yuvipanda commented 4 years ago

I have a small script that tells me how many users' home directories can be archived.

We have a total of 21926 home directories in datahub homes, of which only 6827 meet criteria for archival. With a 6 month marker, it becomes 12465, but let's not do that yet.

yuvipanda commented 4 years ago

For data100, from a total of 5708 home directories, we'll have to keep 4368.

yuvipanda commented 4 years ago

I'll just update the table above with numbers for different hubs.

yuvipanda commented 4 years ago

412GB is 'other hubs', including 100GB just for EECS hub. I filed https://github.com/berkeley-dsep-infra/datahub/issues/1635 to have EECS get its own disk. I think we should probably have 1 other disk that takes everything right.

felder commented 4 years ago

@yuvipanda it'd be good to note what the actual compression ratio of the compressed files is.

yuvipanda commented 3 years ago

I'm picking this back up now. Surprisingly, I never put up the script I used to calculate these numbers last time. I've put it up this time - https://github.com/yuvipanda/homedir-archiver. I'm running this to find users to archive, where a user is archivable if they have not modified any file for the last 12 months, starting from today. Some hubs are too new to have any user files over 12 months, and I've just put a - for them

I'm updating this table as I run the script. I initially ran it under ionice to not disrupt regular operations, but am now running it without ionice - iowait seems ok.

Hub	Total size (GB)	Archivable (GB)	Savings %
datahub (+ r hub)	15,216	4,094	27%
data100	7,903	2,277	28%
data102	238	51	21%
eecs	2,073	-	-
cs194	78	-	-
stat159	13	-	-
biology	891	-	-
dlab	338	-	-
prob140	205	138	67%
workshop	34	25	71%
julia	26	-	-
total	27,015	6,585	24%

yuvipanda commented 3 years ago

My suggestion now is that we:

Tar these up, and store them in google cloud storage. We'll transition these to the archival tier once we're sure this works - for 6,585 GB it'll cost about 8$ a month. I'm sure it'll go down with compression.
Leave a note asking users to email a list (ds-infrastructure probably?) if they need their files. I suspect this will be a pretty small group, since we're explicitly only doing this for users who have not logged in for one full year. If the amount of requests increase, we can work on a more automated solution.

This will need to be done carefully to make sure we do not lose any data.

I also want to see what happens if I make the cutoff to be 'just after spring 2020' rather than '12 months before today' - I suspect we can save more space. The overall plan remains the same though.

Once we've cleaned this up, we'll create newer, smaller disks, copy the files over and the NFS server to use them.

felder commented 3 years ago

@yuvipanda we should consider what the overall retention policy for these will be or perhaps consider object lifecycle management on the bucket to move items to cheaper long term storage.

yuvipanda commented 3 years ago

yeah, I think we should start with standard storage and move it to archival storage in a week or two.

I think my current plan would be:

Tar and push inactive home directories to GCS, using md5 to ensure no corruption during data transfer
Do another round, tar and just verify MD5 to make sure everything made it ok. Our tarballs should be reproducible for this to work.
Take a disk snapshot for additional (temporary) protection against human error
Delete contents of all inactive home directories, alongside another (2) to make sure we aren't deleting things we don't have backups of
Add a note to the home directories with instructions on how to ask for this data.

yuvipanda commented 3 years ago

I tested out gzip, bzip2 and xz on the workshop hub's inactive home directories.

Type	Original Size	Compressed Size	Savings %	Time taken
gzip2	25.02GB	16.25GB	35%	27 minutes
xz	25.02GB	14.75GB	41%	A lot longer
bzip2	25.02GB	15.89GB	37%	142 minutes

I didn't run the xz test under time unfortunately, but it was many many hours - at least 6. This was all done with default compression levels of tar.

yuvipanda commented 3 years ago

I'm doing this now fully, so we have the clear disk space before start of semester.

I've discovered we had an old set of home directories from other hubs in the datahub home directory setup - from before the time we had separate disks for each hub.

Disk space usage:

root@nfsserver-01:/home/yuvipanda# ls /export/datahubhomes-2020-07-29/homes/ | rg '^_' | xargs -L1 -I{} du -d0 -h /export/datahubhomes-2020-07-29/homes/{}

7.7G    /export/datahubhomes-2020-07-29/homes/_buds-2020
17M     /export/datahubhomes-2020-07-29/homes/_canvas-test
4.5G    /export/datahubhomes-2020-07-29/homes/_cogneuro
13G     /export/datahubhomes-2020-07-29/homes/_data100
78G     /export/datahubhomes-2020-07-29/homes/_data102
74G     /export/datahubhomes-2020-07-29/homes/_eecs
20G     /export/datahubhomes-2020-07-29/homes/_math124
171G    /export/datahubhomes-2020-07-29/homes/_prob140
2.4G    /export/datahubhomes-2020-07-29/homes/_shared
12K     /export/datahubhomes-2020-07-29/homes/_stat131a
7.1G    /export/datahubhomes-2020-07-29/homes/_stat89a
33G     /export/datahubhomes-2020-07-29/homes/_workshop

File update times:

/export/ischool-2021-07-01/old-hub-files/:
total 156
drwxr-xr-x   72 ubuntu ubuntu  4096 Jul 13  2020 _buds-2020
drwxr-xr-x    7 ubuntu ubuntu    71 Jul  9  2019 _canvas-test
drwxrwxr-x    5 ubuntu ubuntu    89 Sep 23  2017 _cogneuro
drwxr-xr-x    5 ubuntu ubuntu    52 Jan 28  2019 _data100
drwxr-xr-x  412 ubuntu ubuntu 12288 Jul 27  2020 _data102
drwxr-xr-x  227 ubuntu ubuntu  8192 Jul 30  2020 _eecs
drwxr-xr-x   52 ubuntu ubuntu  4096 May 18  2019 _math124
drwxr-xr-x 2330 ubuntu ubuntu 61440 Jul 29  2020 _prob140
drwxr-xr-x    6 ubuntu ubuntu    78 Mar  2 10:11 _shared
drwxr-xr-x    3 ubuntu ubuntu    18 Aug 19  2019 _stat131a
drwxr-xr-x  111 ubuntu ubuntu  4096 May 28  2020 _stat89a
drwxr-xr-x  308 ubuntu ubuntu 12288 Jul 28  2020 _workshop

I've temporarily moved them over to /export/ischool-2021-07-01/old-hub-files/ so the datahub archiver can proceed (and be faster). I believe this data was already copied over when we moved to using multiple disks in our NFS server. Let's verify that and then get rid of these

yuvipanda commented 3 years ago

prob140 was just cleaned!

Active: 1803, Inactive: 1916, Inactive Uncompressed Size: 142.26, Inactive Compressed Size: 94.44gb

data100 and datahub are going to be the biggies

berkeley-dsep-infra / datahub

Implement data archival policy for home directories #1633