Closed chrisnatali closed 6 years ago
Upping the ebs volume size to 16G for now:
$ df -H
Filesystem Size Used Avail Use% Mounted on
udev 4.2G 0 4.2G 0% /dev
tmpfs 838M 89M 749M 11% /run
/dev/xvda1 17G 8.3G 8.4G 50% /
tmpfs 4.2G 0 4.2G 0% /dev/shm
tmpfs 5.3M 0 5.3M 0% /run/lock
tmpfs 4.2G 0 4.2G 0% /sys/fs/cgroup
Note: For aws, this required resizing the EBS volume and then extending the partition and filesystem as described here
Was this on the master or the worker nodes? (Our setup has a small sized master too but we have more flexibility on the workers)
In the modelrunner.io
configuration, I have the workers setup with much smaller permanent storage. They were setup too small in the case above because I had not considered how much space the conda packages required.
The primary server has a much larger permanent storage (80G) since all of the model inputs and outputs are retained on the primary. The idea is that workers may come and go (along with their storage), but the history of model runs and data is retained on the primary server for much longer.
That said, the whole system is only meant to act as a "runner" of models and users should not expect their data to be available forever. They should manage the inputs and outputs themselves.
Noticed that modelrunner nodes were low or out of disk space, causing jobs to fail:
Where
/dev/xvda1
is the main volume for modelrunner code and data.When I set these up, it seemed that 8G of storage would be sufficient.
Turns out that the conda packages required for the models being run take up a lot of space:
cc @ingenieroariel @edwinadkinsdev