User data quota cron job

asmacdo commented 4 months ago

IMO the choice of what files are safe to remove should come from the science side, its hard to guess what is safe to remove.

Heres an initial list:

home//.cache
pycache files
nwb_cache https://github.com/dandi/dandi-hub/issues/176#issuecomment-2244414690

Lets also provide each user with a scratch directory that cleans up files more than 30 days old.

The Plan (by @yarikoptic and @asmacdo )

[ ] develop a simple efficient script which would dump listing of content on EFS into a json or jsonlines file (might be worth compressing it right away with built-in to python compressor?)
- [ ] just run it interactively on an EC2 instance to get idea on the time/size requirement
- [ ] add to that script information on date when it ran and how long it took to execute
[ ] attach that script to be ran by AWS lambda and upload produced listing somewhere
- [ ] unversioned S3 bucket might be the best choice
[ ] develop a script which consumes that dump and
- [ ] does analysis according to the policy (#188)
- [ ] eventually manages (sends, marks sent etc) notifications

Sample scripts from chatgpt to collect and analyze stats are available at https://chatgpt.com/share/6732630c-4e54-8002-bf09-41df8175b6d0 .

asmacdo commented 4 months ago

dendro tmp: dendro_compute_resource/jobs/9f26d4e2.6a1f740e/tmp/working/recording.dat: 0.106723

kabilar commented 4 months ago

Thanks Austin. Once we determine our policy, let's also update the DANDI Handbook to add this information.

asmacdo commented 4 months ago

After chatting with @yarikoptic, we probably should not be too ambitious with a cron job.

All above should be run only prior to a migration (or kicked off manually if we want to reclaim some space).

There may still be a good use for a cron job-- cleaning up files older than X in a provided scratch dir.

yarikoptic commented 4 months ago

I think we should indeed not cleanup anything automatically, especially since we are not tracking "access time" but only "modification time" on files - we can't make judgement if anything is still in use or not.

What cron job should do is per user:

check when user last used/logged in to hub
- since we do not keep that info AFAIK, let's use last dandiarchive login information
- that datetime would serve us as "cache-recency-identifier"
if for a given cache-recency-identifier we do not have up-to-date "statistics" for the user
- run script to du entire home/{user} and /shared/{user}
- get some specific dus:
  - find files larger than 1GB and mtime > 30 (?) days -- get total size and count
  - find _pycache_ and nwb-cache folders and pip cache and mtime > 30? days -- total sizes and list of them
- Given any of below conditions are met, email user to notify about usage (list large outdated files etc) and notify about possible actions to be taken (to be decided), and annotate record that for that recency identified email was sent out
- Conditions:
  - total du exceeds some threshold (e.g. 100G)
  - total outdated caches size exceeds some threshold (e.g. 1G)
  - prior notification was sent more than a week ago
- Actions:
  - ask to login and clean up
after a sweep, given a list of users and their cache-recency-identifiers, per each user
- if did have email sent, and cache-recency-identifier > 60 days (so didn't login/cleanup)
- email with the list of large files and caches to be cleaned up automagically in 10 days if not logged in/cleanup
- if did have email sent, if cache-recency-identifier > 70 days (so didn't login/cleanup even after notification)
- automatically cleanup and send email with a list of files which were removed
- remove recency identifier so we get back into checking that user.

Something like that?

please prepare design doc @asmacdo with above as a PR so we could iron out the behavior and then add it to ToS etc.

asmacdo commented 4 months ago

Awesome, thanks @yarikoptic

asmacdo commented 4 months ago

check when user last used/logged in to hub

This information is kept, but it's tracked by jupyterhub itself. I'll look into connecting to the REST API directly https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users

dandi / dandi-hub

User data quota cron job #177

The Plan (by @yarikoptic and @asmacdo )