My computation is locking up, I believe because my workers are both out of memory, and out of disk space to spill to. It appears that the AWS VMs are given 30GB by default, and there's no option to customize that. (I know Fargate's ephemeral storage size can't be changed, so doing this might be a pain, with setting up/tearing down something with EFS for every cluster.)
Why I think running out of disk space is (part of) the problem:
Seeing OSError: [Errno 28] No space left on device in worker logs (from zict/file.py)
```
Traceback (most recent call last):
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 2063, in gather_dep
response = await get_data_from_worker(
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3333, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils_comm.py", line 384, in retry_operation
return await retry(
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils_comm.py", line 369, in retry
return await coro()
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3313, in _get_data
response = await send_recv(
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py", line 660, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py", line 500, in handle_comm
result = await result
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 1342, in get_data
data = {k: self.data[k] for k in keys if k in self.data}
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 1342, in
data = {k: self.data[k] for k in keys if k in self.data}
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 78, in __getitem__
return self.slow_to_fast(key)
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 69, in slow_to_fast
self.fast[key] = value
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/lru.py", line 70, in __setitem__
self.evict()
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/lru.py", line 89, in evict
cb(k, v)
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 60, in fast_to_slow
self.slow[key] = value
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/func.py", line 41, in __setitem__
self.d[key] = self.dump(value)
File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/file.py", line 84, in __setitem__
f.write(value)
OSError: [Errno 28] No space left on device
```
`df` says there's no space left on `/`
```python
def df():
import subprocess
return subprocess.check_output(["df", "-h"])
for worker, out in client.run(df).items():
print(worker)
print(out.decode())
print()
```
```
tls://10.4.11.171:34523
Filesystem Size Used Avail Use% Mounted on
overlay 30G 30G 0 100% /
tmpfs 64M 0 64M 0% /dev
shm 15G 24K 15G 1% /dev/shm
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/xvdcz 30G 30G 0 100% /etc/hosts
tmpfs 15G 0 15G 0% /proc/acpi
tmpfs 15G 0 15G 0% /sys/firmware
tmpfs 15G 0 15G 0% /proc/scsi
tls://10.4.11.27:37507
Filesystem Size Used Avail Use% Mounted on
overlay 30G 30G 0 100% /
tmpfs 64M 0 64M 0% /dev
shm 15G 24K 15G 1% /dev/shm
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/xvdcz 30G 30G 0 100% /etc/hosts
tmpfs 15G 0 15G 0% /proc/acpi
tmpfs 15G 0 15G 0% /sys/firmware
tmpfs 15G 0 15G 0% /proc/scsi
tls://10.4.12.139:33447
Filesystem Size Used Avail Use% Mounted on
overlay 30G 30G 0 100% /
tmpfs 64M 0 64M 0% /dev
shm 15G 24K 15G 1% /dev/shm
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/xvdcz 30G 30G 0 100% /etc/hosts
tmpfs 15G 0 15G 0% /proc/acpi
tmpfs 15G 0 15G 0% /sys/firmware
tmpfs 15G 0 15G 0% /proc/scsi
tls://10.4.12.243:37211
Filesystem Size Used Avail Use% Mounted on
overlay 30G 30G 0 100% /
tmpfs 64M 0 64M 0% /dev
shm 15G 24K 15G 1% /dev/shm
tmpfs 15G 0 15G 0% /sys/fs/cgroup
/dev/xvdcz 30G 30G 0 100% /etc/hosts
tmpfs 15G 0 15G 0% /proc/acpi
tmpfs 15G 0 15G 0% /sys/firmware
tmpfs 15G 0 15G 0% /proc/scsi
```
Though I'm confused that not all workers need to spill:
```python
>>> client.run(lambda dask_worker: len(dask_worker.data.fast))
{'tls://10.4.11.171:34523': 947,
'tls://10.4.11.27:37507': 951,
'tls://10.4.12.139:33447': 979,
'tls://10.4.12.243:37211': 905}
```
```python
def should_spill(dask_worker):
import psutil
rss = psutil.Process().memory_info().rss
frac = rss / dask_worker.memory_limit
spill_frac = dask_worker.memory_spill_fraction
return f"Should spill: {frac > spill_frac} == {frac} > {spill_frac}"
client.run(should_spill)
# {'tls://10.4.11.171:34523': 'Should spill: True == 0.7019927501678467 > 0.7',
# 'tls://10.4.11.27:37507': 'Should spill: False == 0.6774005889892578 > 0.7',
# 'tls://10.4.12.139:33447': 'Should spill: False == 0.6875629425048828 > 0.7',
# 'tls://10.4.12.243:37211': 'Should spill: True == 0.7100732326507568 > 0.7'}
```
FWIW, having observability in the distributed dashboard for available disk space—or the failure rate spilling to disk—might also be worthwhile. Particularly if/when this feature is added to Coiled, so users can tell they actually need to increase the disk size. I could imagine changing the "bytes stored" dashboard to a unified view of both available memory and disk space, with a progress-bar-style indicator of how much was in use.
Let me know you think that would be worthwhile, and if you want me to open an issue on distributed for it.
My computation is locking up, I believe because my workers are both out of memory, and out of disk space to spill to. It appears that the AWS VMs are given 30GB by default, and there's no option to customize that. (I know Fargate's ephemeral storage size can't be changed, so doing this might be a pain, with setting up/tearing down something with EFS for every cluster.)
Why I think running out of disk space is (part of) the problem:
Seeing
``` Traceback (most recent call last): File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 2063, in gather_dep response = await get_data_from_worker( File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3333, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils_comm.py", line 384, in retry_operation return await retry( File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils_comm.py", line 369, in retry return await coro() File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3313, in _get_data response = await send_recv( File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py", line 660, in send_recv raise exc.with_traceback(tb) File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py", line 500, in handle_comm result = await result File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 1342, in get_data data = {k: self.data[k] for k in keys if k in self.data} File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 1342, inOSError: [Errno 28] No space left on device
in worker logs (fromzict/file.py
)