2i2c-org / docs

Documentation for 2i2c community JupyterHubs.
https://docs.2i2c.org
9 stars 17 forks source link

Document the usage of temp rather than $HOME for keeping temporary data files #218

Open jnywong opened 8 months ago

jnywong commented 8 months ago

Message from Yuvi on Slack:

... The issue here is that writing data to $HOME is very slow, and it is also shared across all users. The cell that got stuck was trying to write ~1GB of data to $HOME, and when spread across the 100+ users, it turned everything super slow! This is one of the reasons 'cloud native' workflows directly doing object storage are faster, because they don't have to touch possibly slow local disks. $HOME is designed to store code, rather than data.

The solution here is to use the temporary directory to keep temporary data files. These will reset each time the user server restarts, and are also much faster. Plus they are not shared across all users. This also works across local machines and any cloud providers. The python tempfile standard library module is probably very helpful here.

So the upshot here is don't use $HOME to store data. It also means it doesn't get cleaned up, and will cost money sort of indefinitely into the future as well. Plus, it leads to issues when doing workshops. Use tempfile if you need to download data locally.

I hope this was helpful! I think it'll also be helpful for this to be set up in some sort of outside documentation, but not sure where.

yuvipanda commented 8 months ago

An additional point here is that while $HOME usually has a lot of space, /tmp does not. As an example, if I run the command that prints out disk usage (df -h) in a terminal on the openscapes hub, I get:

(notebook) jovyan@jupyter-yuvipanda:~$ df -h
Filesystem                                               Size  Used Avail Use% Mounted on
overlay                                                   80G   24G   57G  30% /
tmpfs                                                     64M     0   64M   0% /dev
tmpfs                                                     16G     0   16G   0% /sys/fs/cgroup
fs-b25253b5.efs.us-west-2.amazonaws.com:/prod/yuvipanda  8.0E  3.2T  8.0E   1% /home/jovyan
shm                                                       64M     0   64M   0% /dev/shm
/dev/nvme0n1p1                                            80G   24G   57G  30% /etc/hosts
fs-b25253b5.efs.us-west-2.amazonaws.com:/prod/_shared    8.0E  3.2T  8.0E   1% /home/jovyan/shared-readwrite
fs-b25253b5.efs.us-west-2.amazonaws.com:/prod/_shared    8.0E  3.2T  8.0E   1% /home/rstudio/shared
fs-b25253b5.efs.us-west-2.amazonaws.com:/prod/_shared    8.0E  3.2T  8.0E   1% /home/rstudio/shared-readwrite
fs-b25253b5.efs.us-west-2.amazonaws.com:/prod/_shared    8.0E  3.2T  8.0E   1% /home/jovyan/shared
tmpfs                                                     30G   16K   30G   1% /mnt/ghsa-w3vc-fx9p-wp4v/check-patch-run
tmpfs                                                     30G   12K   30G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                     30G  4.0K   30G   1% /run/secrets/eks.amazonaws.com/serviceaccount
tmpfs                                                     16G     0   16G   0% /proc/acpi
tmpfs                                                     16G     0   16G   0% /sys/firmware

There's a bunch of extra stuff here (including stuff that says tmpfs but you can ignore that), but primary use to us is two:

Filesystem                                               Size  Used Avail Use% Mounted on
overlay                                                   80G   24G   57G  30% /
fs-b25253b5.efs.us-west-2.amazonaws.com:/prod/yuvipanda  8.0E  3.2T  8.0E   1% /home/jovyan

We can tell that the temporary directory is under / because of the following python code:

>>> import tempfile
>>> tempfile.gettempdir()
'/tmp'
>>> 

And looking at the list of various mountpoints in the df -h output, you can see that / is the one that /tmp is under. If there was a specific mount for /tmp it would show up there!

Anyway, the thing to note here is that there is 57G available for all users on a single node when using /tmp. You can't see other people's /tmp, but it does take up space! And which two users are placed on a node is not upto the users, so it is possible for a user to use up all the space available for /tmp in a particular node.

yuvipanda commented 8 months ago

I've discussed some potential solutions in https://github.com/2i2c-org/infrastructure/issues/3833, but don't believe we should put more effort into them at this point.