ModelRunner node instances out of disk space

chrisnatali commented 6 years ago

Noticed that modelrunner nodes were low or out of disk space, causing jobs to fail:

$ df -H
Filesystem      Size  Used Avail Use% Mounted on
udev            4.2G     0  4.2G   0% /dev
tmpfs           838M   85M  753M  11% /run
/dev/xvda1      8.3G  8.3G     0 100% /
tmpfs           4.2G     0  4.2G   0% /dev/shm
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs           4.2G     0  4.2G   0% /sys/fs/cgroup

Where /dev/xvda1 is the main volume for modelrunner code and data.

When I set these up, it seemed that 8G of storage would be sufficient.

Turns out that the conda packages required for the models being run take up a lot of space:

$ sudo du -h -d1 /home/mr/miniconda
308K    /home/mr/miniconda/conda-meta
4.4M    /home/mr/miniconda/bin
364K    /home/mr/miniconda/ssl
376K    /home/mr/miniconda/share
3.8M    /home/mr/miniconda/include
103M    /home/mr/miniconda/lib
4.0G    /home/mr/miniconda/pkgs
12K /home/mr/miniconda/etc
316M    /home/mr/miniconda/envs
4.4G    /home/mr/miniconda

cc @ingenieroariel @edwinadkinsdev

chrisnatali commented 6 years ago

Upping the ebs volume size to 16G for now:

$ df -H
Filesystem      Size  Used Avail Use% Mounted on
udev            4.2G     0  4.2G   0% /dev
tmpfs           838M   89M  749M  11% /run
/dev/xvda1       17G  8.3G  8.4G  50% /
tmpfs           4.2G     0  4.2G   0% /dev/shm
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs           4.2G     0  4.2G   0% /sys/fs/cgroup

Note: For aws, this required resizing the EBS volume and then extending the partition and filesystem as described here

ingenieroariel commented 6 years ago

Was this on the master or the worker nodes? (Our setup has a small sized master too but we have more flexibility on the workers)

chrisnatali commented 6 years ago

In the modelrunner.io configuration, I have the workers setup with much smaller permanent storage. They were setup too small in the case above because I had not considered how much space the conda packages required.

The primary server has a much larger permanent storage (80G) since all of the model inputs and outputs are retained on the primary. The idea is that workers may come and go (along with their storage), but the history of model runs and data is retained on the primary server for much longer.

That said, the whole system is only meant to act as a "runner" of models and users should not expect their data to be available forever. They should manage the inputs and outputs themselves.

SEL-Columbia / modelrunner

ModelRunner node instances out of disk space #110