Open yuvipanda opened 4 years ago
I have a small script that tells me how many users' home directories can be archived.
We have a total of 21926
home directories in datahub homes, of which only 6827
meet criteria for archival. With a 6 month marker, it becomes 12465
, but let's not do that yet.
For data100, from a total of 5708
home directories, we'll have to keep 4368.
I'll just update the table above with numbers for different hubs.
412GB is 'other hubs', including 100GB just for EECS hub. I filed https://github.com/berkeley-dsep-infra/datahub/issues/1635 to have EECS get its own disk. I think we should probably have 1 other disk that takes everything right.
@yuvipanda it'd be good to note what the actual compression ratio of the compressed files is.
I'm picking this back up now. Surprisingly, I never put up the script I used to calculate these numbers last time. I've put it up this time - https://github.com/yuvipanda/homedir-archiver. I'm running this to find users to archive, where a user is archivable if they have not modified any file for the last 12 months, starting from today. Some hubs are too new to have any user files over 12 months, and I've just put a -
for them
I'm updating this table as I run the script. I initially ran it under ionice
to not disrupt regular operations, but am now running it without ionice - iowait seems ok.
Hub | Total size (GB) | Archivable (GB) | Savings % |
---|---|---|---|
datahub (+ r hub) | 15,216 | 4,094 | 27% |
data100 | 7,903 | 2,277 | 28% |
data102 | 238 | 51 | 21% |
eecs | 2,073 | - | - |
cs194 | 78 | - | - |
stat159 | 13 | - | - |
biology | 891 | - | - |
dlab | 338 | - | - |
prob140 | 205 | 138 | 67% |
workshop | 34 | 25 | 71% |
julia | 26 | - | - |
total | 27,015 | 6,585 | 24% |
My suggestion now is that we:
This will need to be done carefully to make sure we do not lose any data.
I also want to see what happens if I make the cutoff to be 'just after spring 2020' rather than '12 months before today' - I suspect we can save more space. The overall plan remains the same though.
Once we've cleaned this up, we'll create newer, smaller disks, copy the files over and the NFS server to use them.
@yuvipanda we should consider what the overall retention policy for these will be or perhaps consider object lifecycle management on the bucket to move items to cheaper long term storage.
yeah, I think we should start with standard storage and move it to archival storage in a week or two.
I think my current plan would be:
I tested out gzip, bzip2 and xz on the workshop hub's inactive home directories.
Type | Original Size | Compressed Size | Savings % | Time taken |
---|---|---|---|---|
gzip2 | 25.02GB | 16.25GB | 35% | 27 minutes |
xz | 25.02GB | 14.75GB | 41% | A lot longer |
bzip2 | 25.02GB | 15.89GB | 37% | 142 minutes |
I didn't run the xz test under time
unfortunately, but it was many many hours - at least 6. This was all done with default compression levels of tar
.
I'm doing this now fully, so we have the clear disk space before start of semester.
I've discovered we had an old set of home directories from other hubs in the datahub home directory setup - from before the time we had separate disks for each hub.
Disk space usage:
root@nfsserver-01:/home/yuvipanda# ls /export/datahubhomes-2020-07-29/homes/ | rg '^_' | xargs -L1 -I{} du -d0 -h /export/datahubhomes-2020-07-29/homes/{}
7.7G /export/datahubhomes-2020-07-29/homes/_buds-2020
17M /export/datahubhomes-2020-07-29/homes/_canvas-test
4.5G /export/datahubhomes-2020-07-29/homes/_cogneuro
13G /export/datahubhomes-2020-07-29/homes/_data100
78G /export/datahubhomes-2020-07-29/homes/_data102
74G /export/datahubhomes-2020-07-29/homes/_eecs
20G /export/datahubhomes-2020-07-29/homes/_math124
171G /export/datahubhomes-2020-07-29/homes/_prob140
2.4G /export/datahubhomes-2020-07-29/homes/_shared
12K /export/datahubhomes-2020-07-29/homes/_stat131a
7.1G /export/datahubhomes-2020-07-29/homes/_stat89a
33G /export/datahubhomes-2020-07-29/homes/_workshop
File update times:
/export/ischool-2021-07-01/old-hub-files/:
total 156
drwxr-xr-x 72 ubuntu ubuntu 4096 Jul 13 2020 _buds-2020
drwxr-xr-x 7 ubuntu ubuntu 71 Jul 9 2019 _canvas-test
drwxrwxr-x 5 ubuntu ubuntu 89 Sep 23 2017 _cogneuro
drwxr-xr-x 5 ubuntu ubuntu 52 Jan 28 2019 _data100
drwxr-xr-x 412 ubuntu ubuntu 12288 Jul 27 2020 _data102
drwxr-xr-x 227 ubuntu ubuntu 8192 Jul 30 2020 _eecs
drwxr-xr-x 52 ubuntu ubuntu 4096 May 18 2019 _math124
drwxr-xr-x 2330 ubuntu ubuntu 61440 Jul 29 2020 _prob140
drwxr-xr-x 6 ubuntu ubuntu 78 Mar 2 10:11 _shared
drwxr-xr-x 3 ubuntu ubuntu 18 Aug 19 2019 _stat131a
drwxr-xr-x 111 ubuntu ubuntu 4096 May 28 2020 _stat89a
drwxr-xr-x 308 ubuntu ubuntu 12288 Jul 28 2020 _workshop
I've temporarily moved them over to /export/ischool-2021-07-01/old-hub-files/ so the datahub archiver can proceed (and be faster). I believe this data was already copied over when we moved to using multiple disks in our NFS server. Let's verify that and then get rid of these
prob140 was just cleaned!
Active: 1803, Inactive: 1916, Inactive Uncompressed Size: 142.26, Inactive Compressed Size: 94.44gb
data100 and datahub are going to be the biggies
We theoretically archive home directories of users who haven't touched it in 12 months, based on this policy - https://docs.datahub.berkeley.edu/en/latest/users/storage-retention.html.
We don't do that yet, but we should! I want this to be a fully self-serve setup - so students should be able to acquire & unzip their home directories whenever they choose to, without having to bug us.
Other hubs are new enough that I don't think it's worth doing anything about them.