dandi / dandi-hub

Infrastructure and code for the dandihub
https://hub.dandiarchive.org
Other
11 stars 23 forks source link

User data quota cron job #177

Open asmacdo opened 4 months ago

asmacdo commented 4 months ago

IMO the choice of what files are safe to remove should come from the science side, its hard to guess what is safe to remove.

Heres an initial list:

Lets also provide each user with a scratch directory that cleans up files more than 30 days old.

The Plan (by @yarikoptic and @asmacdo )

Sample scripts from chatgpt to collect and analyze stats are available at https://chatgpt.com/share/6732630c-4e54-8002-bf09-41df8175b6d0 .

asmacdo commented 4 months ago
kabilar commented 4 months ago

Thanks Austin. Once we determine our policy, let's also update the DANDI Handbook to add this information.

asmacdo commented 4 months ago

After chatting with @yarikoptic, we probably should not be too ambitious with a cron job.

All above should be run only prior to a migration (or kicked off manually if we want to reclaim some space).

There may still be a good use for a cron job-- cleaning up files older than X in a provided scratch dir.

yarikoptic commented 4 months ago

I think we should indeed not cleanup anything automatically, especially since we are not tracking "access time" but only "modification time" on files - we can't make judgement if anything is still in use or not.

What cron job should do is per user:

Something like that?

please prepare design doc @asmacdo with above as a PR so we could iron out the behavior and then add it to ToS etc.

asmacdo commented 4 months ago

Awesome, thanks @yarikoptic

asmacdo commented 4 months ago

check when user last used/logged in to hub

This information is kept, but it's tracked by jupyterhub itself. I'll look into connecting to the REST API directly https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users