coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Allow specifying disk size when creating clusters #125

Closed gjoseph92 closed 1 year ago

gjoseph92 commented 3 years ago

My computation is locking up, I believe because my workers are both out of memory, and out of disk space to spill to. It appears that the AWS VMs are given 30GB by default, and there's no option to customize that. (I know Fargate's ephemeral storage size can't be changed, so doing this might be a pain, with setting up/tearing down something with EFS for every cluster.)

Why I think running out of disk space is (part of) the problem:

Seeing OSError: [Errno 28] No space left on device in worker logs (from zict/file.py) ``` Traceback (most recent call last): File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 2063, in gather_dep response = await get_data_from_worker( File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3333, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils_comm.py", line 384, in retry_operation return await retry( File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/utils_comm.py", line 369, in retry return await coro() File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 3313, in _get_data response = await send_recv( File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py", line 660, in send_recv raise exc.with_traceback(tb) File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/core.py", line 500, in handle_comm result = await result File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 1342, in get_data data = {k: self.data[k] for k in keys if k in self.data} File "/opt/conda/envs/coiled/lib/python3.8/site-packages/distributed/worker.py", line 1342, in data = {k: self.data[k] for k in keys if k in self.data} File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 78, in __getitem__ return self.slow_to_fast(key) File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 69, in slow_to_fast self.fast[key] = value File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/lru.py", line 70, in __setitem__ self.evict() File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/lru.py", line 89, in evict cb(k, v) File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/buffer.py", line 60, in fast_to_slow self.slow[key] = value File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/func.py", line 41, in __setitem__ self.d[key] = self.dump(value) File "/opt/conda/envs/coiled/lib/python3.8/site-packages/zict/file.py", line 84, in __setitem__ f.write(value) OSError: [Errno 28] No space left on device ```
`df` says there's no space left on `/` ```python def df(): import subprocess return subprocess.check_output(["df", "-h"]) for worker, out in client.run(df).items(): print(worker) print(out.decode()) print() ``` ``` tls://10.4.11.171:34523 Filesystem Size Used Avail Use% Mounted on overlay 30G 30G 0 100% / tmpfs 64M 0 64M 0% /dev shm 15G 24K 15G 1% /dev/shm tmpfs 15G 0 15G 0% /sys/fs/cgroup /dev/xvdcz 30G 30G 0 100% /etc/hosts tmpfs 15G 0 15G 0% /proc/acpi tmpfs 15G 0 15G 0% /sys/firmware tmpfs 15G 0 15G 0% /proc/scsi tls://10.4.11.27:37507 Filesystem Size Used Avail Use% Mounted on overlay 30G 30G 0 100% / tmpfs 64M 0 64M 0% /dev shm 15G 24K 15G 1% /dev/shm tmpfs 15G 0 15G 0% /sys/fs/cgroup /dev/xvdcz 30G 30G 0 100% /etc/hosts tmpfs 15G 0 15G 0% /proc/acpi tmpfs 15G 0 15G 0% /sys/firmware tmpfs 15G 0 15G 0% /proc/scsi tls://10.4.12.139:33447 Filesystem Size Used Avail Use% Mounted on overlay 30G 30G 0 100% / tmpfs 64M 0 64M 0% /dev shm 15G 24K 15G 1% /dev/shm tmpfs 15G 0 15G 0% /sys/fs/cgroup /dev/xvdcz 30G 30G 0 100% /etc/hosts tmpfs 15G 0 15G 0% /proc/acpi tmpfs 15G 0 15G 0% /sys/firmware tmpfs 15G 0 15G 0% /proc/scsi tls://10.4.12.243:37211 Filesystem Size Used Avail Use% Mounted on overlay 30G 30G 0 100% / tmpfs 64M 0 64M 0% /dev shm 15G 24K 15G 1% /dev/shm tmpfs 15G 0 15G 0% /sys/fs/cgroup /dev/xvdcz 30G 30G 0 100% /etc/hosts tmpfs 15G 0 15G 0% /proc/acpi tmpfs 15G 0 15G 0% /sys/firmware tmpfs 15G 0 15G 0% /proc/scsi ``` Though I'm confused that not all workers need to spill: ```python >>> client.run(lambda dask_worker: len(dask_worker.data.fast)) {'tls://10.4.11.171:34523': 947, 'tls://10.4.11.27:37507': 951, 'tls://10.4.12.139:33447': 979, 'tls://10.4.12.243:37211': 905} ``` ```python def should_spill(dask_worker): import psutil rss = psutil.Process().memory_info().rss frac = rss / dask_worker.memory_limit spill_frac = dask_worker.memory_spill_fraction return f"Should spill: {frac > spill_frac} == {frac} > {spill_frac}" client.run(should_spill) # {'tls://10.4.11.171:34523': 'Should spill: True == 0.7019927501678467 > 0.7', # 'tls://10.4.11.27:37507': 'Should spill: False == 0.6774005889892578 > 0.7', # 'tls://10.4.12.139:33447': 'Should spill: False == 0.6875629425048828 > 0.7', # 'tls://10.4.12.243:37211': 'Should spill: True == 0.7100732326507568 > 0.7'} ``` FWIW, having observability in the distributed dashboard for available disk space—or the failure rate spilling to disk—might also be worthwhile. Particularly if/when this feature is added to Coiled, so users can tell they actually need to increase the disk size. I could imagine changing the "bytes stored" dashboard to a unified view of both available memory and disk space, with a progress-bar-style indicator of how much was in use. Let me know you think that would be worthwhile, and if you want me to open an issue on distributed for it.
shughes-uk commented 1 year ago

Is now available