Open asmacdo opened 4 months ago
dendro_compute_resource/jobs/9f26d4e2.6a1f740e/tmp/working/recording.dat: 0.106723
Thanks Austin. Once we determine our policy, let's also update the DANDI Handbook to add this information.
After chatting with @yarikoptic, we probably should not be too ambitious with a cron job.
All above should be run only prior to a migration (or kicked off manually if we want to reclaim some space).
There may still be a good use for a cron job-- cleaning up files older than X in a provided scratch dir.
I think we should indeed not cleanup anything automatically, especially since we are not tracking "access time" but only "modification time" on files - we can't make judgement if anything is still in use or not.
What cron job should do is per user:
cache-recency-identifier
we do not have up-to-date "statistics" for the user
du
entire home/{user}
and /shared/{user}
_pycache_
and nwb-cache
folders and pip cache and mtime > 30? days -- total sizes and list of themcache-recency-identifier
s, per each user
cache-recency-identifier
> 60 days (so didn't login/cleanup)cache-recency-identifier
> 70 days (so didn't login/cleanup even after notification) Something like that?
please prepare design doc @asmacdo with above as a PR so we could iron out the behavior and then add it to ToS etc.
Awesome, thanks @yarikoptic
check when user last used/logged in to hub
This information is kept, but it's tracked by jupyterhub itself. I'll look into connecting to the REST API directly https://jupyterhub.readthedocs.io/en/stable/reference/rest-api.html#operation/get-users
IMO the choice of what files are safe to remove should come from the science side, its hard to guess what is safe to remove.
Heres an initial list:
Lets also provide each user with a scratch directory that cleans up files more than 30 days old.
The Plan (by @yarikoptic and @asmacdo )
Sample scripts from chatgpt to collect and analyze stats are available at https://chatgpt.com/share/6732630c-4e54-8002-bf09-41df8175b6d0 .