balajialg commented 2 years ago

Summary

@felder reported data about the amount of storage that gets used across different hubs. He also generated reports for the varied number of users who used a lot of storage at a per hub level. We found that more than 10 users were storing more than 100 gigs of data which easily accumulates to greater than 1 TB of data. I have documented a few scenarios where users stored more than 100 gigs of data which includes storing genomic data for research purposes, storing climate data for assignments, etc. Interestingly, many such users did not even log into their hub instance during the past six months. I am not completely sure about the cost implications for us. However, it would be valuable to generate monthly reports of users across all hubs who use more than 100 GB of data (or any metric that is meaningful to us) so that we can further do due diligence.

Obviously, the long term automation scope would be to send alerts to users and the admins about users with storage greater than a specific metric in a particular hub who did not log into the hub for the past 6 months (as per our policy defined)

User Stories

As a hub administrator, I want to identify inactive users who are storing a large amount of data in their home directories on a per month basis

Important information

Storage across different hubs!

3.2 TiB [##########] /eecs 2.2 TiB [####### ] /genomics 1.3 TiB [### ] /stat159 1.2 TiB [### ] /data8 435.6 GiB [# ] /data102 344.4 GiB [# ] /dlab 92.4 GiB [ ] /prob140 84.8 GiB [ ] /cs194 80.8 GiB [ ] /stat20 57.6 GiB [ ] /astro 33.8 GiB [ ] /julia 20.7 GiB [ ] /buds-2020 7.6 GiB [ ] /workshop 6.7 GiB [ ] /stat89a 408.0 KiB [ ] /xfsconfig 4.0 KiB [ ] - 4.0 KiB [ ] wat @ 0.0 B [ ] highschool @ 0.0 B [ ] biology e 0.0 B [ ] /test

Tasks to complete

[x] Create monthly storage reports based on the criteria defined in a format that is easy to produce

yuvipanda commented 2 years ago

https://rawgit.com/zevv/duc/master/doc/duc.1.html seems helpful for stuff like this

balajialg commented 2 years ago

Copy-pasting this for continuity here

One thing I realized during my discussion with @felder is the perverse incentives created in the long run, by defining the storage limit as 100 gigs for every user. We may be allowing scope creep with our communication to the instructors that anything lesser than 100 GB would be a reasonable limit (yes, we can choose not to communicate this policy which is a reasonable pathway).

What are our goals with regards to storage? Allow our users to run computationally intensive workflow appropriate to their needs with minimal cost spending for unused storage. Obviously, the primary action point is to remove the outliers who are storing enormous amounts of data for their coursework-related/research-related stuff without accessing their storage regularly. Just by this action, we will be able to reduce almost 2 TB of unused storage, which can approximately save us up to $2000 - $2500 per year. (Obviously, we can debate whether these are worthy cost savings for the project if the cost of man-hours invested to optimize is greater than this) So, so this in theory seems like a reasonable short-term action with better returns.

However, thinking about storage from a long-term perspective, Setting 100 gigs may not be a reasonable limit considering that the storage needs may vary depending on the nature of the course/dataset used/type of use cases used, etc.. which makes it hard to generalize across hubs. Hypothetically, For genomics, a storage limit of 100 GB could be a reasonable limit but for a political science course, 5 GB could be the reasonable limit. One of the other important problems it creates is the edge cases Eg: If the user has a home directory storage of 97 GB - will our policy apply in such scenarios? Do we still consider their storage as over the limit and include them as part of the storage limit policy? Should we reduce the storage limit further? This seems like a Nash equilibrium problem.

One suggestion is to define a storage policy that is slightly dynamic for both per course hubs (Data 8, Data 100, potentially I school hub, etc..) and generic hubs (Datahub, R hub, Julia Hub, etc..). Open to debate whether the arbitrary number defined below makes sense.

At a per-course hub level which may include computationally complex use cases, we can identify the median size of all the user home directories and evaluate whether the user's storage is below 3x the median size of all home directories pertaining to their hub. (This could also be learned based on interaction with faculty). At a generic hub level, which is expected to consist of foundational use cases, setting the limits that users' home directories should be lesser than 2x the median size of all home directories specific to the hub. Let me know what you all think about the above policy suggestions? Obviously, we can debate whether we need to optimize at this level for this policy? Are the efforts involved to define the policy worth the cost savings?

balajialg commented 2 years ago

@felder already collects the storage data across multiple hubs and visualizes them using duc command in the NFS server. Should create a process around doing a monthly review of this report :)

Closing this issue considering the progress made :)

berkeley-dsep-infra / datahub

Generate storage reports of users across all hubs! #3376

Summary

User Stories

Important information

Tasks to complete