Integration tests: spill/unspill

fjetter commented 2 years ago

Persist a lot of data on disk

What happens when we load 10x more data than we have RAM?

Does Coiled have enough storage by default? How do we modify this?
How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

Pseudocode

x = da.random.random(...).persist()  # load the data 
wait(x)
x.sum().compute()  # force us to read the data again

[EDIT by @crusaderky ] In addition to the trivial use case above, we would also like to have https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085

scaled up to production size. This will stress the important use case of spilled tasks that are taken out of the spill file and back into memory not to be computed, but to the transferred to another worker. This stress test should find a sizing that is spilling/unspilling heavily but is still completing successfully. Related:

135
140

mrocklin commented 2 years ago

This was mostly done by Naty and Hendrik. Thinks were fine.

On Wed, May 25, 2022 at 4:26 AM Florian Jetter @.***> wrote:

Persist a lot of data on disk

What happens when we load 10x more data than we have RAM?

Does Coiled have enough storage by default? How do we modify this?

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

Pseudocode

x = da.random.random(...).persist() # load the data wait(x)x.sum().compute() # force us to read the data again

— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTALVTHGIYJHKFEZMOLVLXW2RANCNFSM5W4OUKYQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

mrocklin commented 2 years ago

*Things were fine

On Wed, May 25, 2022 at 8:38 AM Matthew Rocklin @.***> wrote:

This was mostly done by Naty and Hendrik. Thinks were fine.

On Wed, May 25, 2022 at 4:26 AM Florian Jetter @.***> wrote:

Persist a lot of data on disk

What happens when we load 10x more data than we have RAM?

Does Coiled have enough storage by default? How do we modify this?

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

Pseudocode

x = da.random.random(...).persist() # load the data wait(x)x.sum().compute() # force us to read the data again

— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTALVTHGIYJHKFEZMOLVLXW2RANCNFSM5W4OUKYQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

--

https://coiled.io

Matthew Rocklin CEO

ntabris commented 2 years ago

Does Coiled have enough storage by default? How do we modify this?

Currently the default is 100GiB EBS, but if there's a local NVMe we'll attach this to /scratch and set dask to use that for temp storage (so, spill).

There isn't currently a way to adjust size of EBS, I'd be interested to know if there's desire/need for that. This means that currently best way to get large disk (and fast disk for disk-intensive workloads) is using instance with NVMe—for example, something from i3 family.

jrbourbeau commented 2 years ago

This was mostly done by Naty and Hendrik

@ncclementi where did this work end up?

ncclementi commented 2 years ago

We tried to persist a lot of data with @hendrikmakait and we were able to do so, things didn't crash, and in the process we discover something that lead to this https://github.com/dask/distributed/pull/6280

We experimented with this, but we did not write a test. We were not quite sure what the test would look like, as the EBS kept expanding.

ntabris commented 2 years ago

the EBS kept expanding

As in, the amount written to disk was expanding? Or disk size was expanding? (That would make me very puzzled.)

ncclementi commented 2 years ago

@ntabris I think I made a bad choice of words, we just saw that it kept spilling and it seemed to never end but I don't recall how close did we get to 100GB, @hendrikmakait Do you remember?

fjetter commented 2 years ago

This was mostly done by Naty and Hendrik. Thinks were fine.

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

we were able to do so, things didn't crash, and in the process we discover something that lead to this

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

ncclementi commented 2 years ago

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like da.random.random((1_000_000, 1_00000), chunks=(10000, 1000)) which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it.

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93

mrocklin commented 2 years ago

Yup. I think that that would work fine. You would probably persist, wait, and then call sum() or something else that pulled the data out of memory. That would be a good thing to test and time

On Tue, May 31, 2022 at 12:24 PM Naty Clementi @.***> wrote:

I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.

I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like da.random.random((1_000_000, 1_00000), chunks=(10000, 1000)) which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it.

Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable

I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93

— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/136#issuecomment-1142411110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAFIOGSO32KCLFXKO3VMZDNTANCNFSM5W4OUKYQ . You are receiving this because you commented.Message ID: @.***>

--

https://coiled.io

Matthew Rocklin CEO

hendrikmakait commented 2 years ago

While working on this, dask/distributed#6783 was identified, which makes it hard to gauge disk space usage on workers.

hendrikmakait commented 2 years ago

XREF: dask/distributed/pull/6835 makes it easier to evaluate disk I/O

hendrikmakait commented 2 years ago

What happens when we load 10x more data than we have RAM?

It depends as we will see below.

Does Coiled have enough storage by default?

Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO:

Previously we attached a 100GB disk to every instance, now the default size will be between 30GB and 100GB and depends on how much memory (RAM) the instance has.

For example, the default t3.medium instance, which has 4 GB of RAM, runs ouf of disk space around 26.5 GB of spillage. In the worker logs, we then find output like this:

Aug  5 10:46:50 ip-10-0-15-123 rsyslogd: action 'action-3-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug  5 10:46:50 ip-10-0-15-123 rsyslogd: file '/var/log/syslog'[7] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug  5 10:46:52 ip-10-0-15-123 dockerd[887]: time="2022-08-05T10:46:52.205685231Z" level=error msg="Failed to log msg \"\" for logger json-file: error writing log entry: write /var/lib/docker/cAug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:     self.d[key] = pickled
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/zict/file.py", line 101, in __setitem__
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]:     fh.writelines(value)
Aug  5 10:46:52 ip-10-0-15-123 cloud-init[1318]: OSError: [Errno 28] No space left on device

For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:

2022-08-05 12:38:04,691 - distributed.protocol.pickle - INFO - Failed to deserialize b"\x80\x04\x95\xad\x19\x00\x00\x00\x00\x00\x00\x8c\x15distributed.scheduler\x94\x8c\x0cKilledWorker\x94\x93\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 23)\x94h\x00\x8c\x0bWorkerState\x94\x93\x94)\x81\x94N}\x94(\x8c\x07address\x94\x8c\[x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18](x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9(%27random_sample-950bfce4388763a088b491ac651d6b18)', 6, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8u\x8c\x0clong_running\x94\x8f\x94\x8c\texecuting\x94}\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 22)\x94G?\xd8hh\x00\x00\x00\x00s\x8c\tresources\x94}\x94\x8c\x0eused_resources\x94}\x94\x8c\x05extra\x94}\x94\x8c\tserver_id\x94\x8c+Worker-e1d62c65-e3d3-41ff-bcd1-97f8dafeeeea\x94u\x86\x94b\x86\x94R\x94."
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
    return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'

Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.

How do we modify this?

We can use worker_disk_size when creating a cluster to allocate sufficient disk space. If we choose a sufficiently large size, everything works as expected.

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

I'd like dask/distributed#6835 merged first before diving into this since we cannot copy values of the WorkerTable.

hendrikmakait commented 2 years ago

How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?

In terms of hard numbers, we reach ~125 MiB on both exclusive read and write. It looks like performance numbers for EBS with EC2 instances are hard to come by (at the very least for the avg. throughput). To check against theoretical performance, we need to perform some experiments on raw EC2 instances.

One caveat I found is that if we keep referencing the large array and then calculate a sum on it, we see ~62 MiB of read and write at the same time. It looks like we are unspilling a chunk to calculate a sum on it, then spilling it back to disk because we need that memory for another chunk. Given the immutability of the chunks, we may want to consider a lazier policy that keeps the spilled data on disk until it should remove it, i.e. it wants to get rid of them both in memory and on disk.

hendrikmakait commented 2 years ago

In addition to the trivial use case above, we would also like to have https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 scaled up to production size.

After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:

Aug  5 14:30:29 ip-10-0-5-24 kernel: [  708.236580] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,mems_allowed=0,global_oom,task_memcg=/docker/c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,task=python,pid=1914,uid=0
Aug  5 14:30:29 ip-10-0-5-24 kernel: [  708.236609] Out of memory: Killed process 1914 (python) total-vm:4358956kB, anon-rss:3259096kB, file-rss:0kB, shmem-rss:8kB, UID:0 pgtables:6660kB oom_score_adj:0

hendrikmakait commented 2 years ago

Does Coiled have enough storage by default?

Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO[...]

To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue. The discussion around default disk sizes can be found here: https://github.com/coiled/oss-engineering/issues/123. Following the discussion on that issue and given how easy it is to adjust disk size with worker_disk_size, the default seems to be reasonable to me.

fjetter commented 2 years ago

To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue

That's fine. I think any multiplier >1 is fine assuming we can configure this on coiled side. The idea of this issue is to apply a workload that requires more memory than there is available on the cluster but can finish successfully if data is stored to disk. Whether this is 1.5x, 2x or 10x is not that important

After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:

Also an interesting find. If chunks are small enough this should always be able to finish. There is an edge case where disk is full, memory is full and the entire cluster pauses. Beyond this edge case, the computation should always finish and we should definitely never see an OOM exception, iff the chunks are small enough

hendrikmakait commented 2 years ago

@fjetter: I'm currently taking a look at what's going on in the tensordot_stress test, will keep you posted.

hendrikmakait commented 2 years ago

For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'
Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.

This was caused by a version mismatch with my custom software environment that used the latest distributed from main. After fixing that, it takes a lot of time for all tasks to either end up in-memory or erred, but when they do, the erred ones return a KilledWorkerException:

distributed.scheduler.KilledWorker("('random_sample-98bbea72b3b5abb3f8bf909c540c55f0', 0, 428)",
                                   <WorkerState 'tls://10.0.3.246:45495', name: hendrik-debug-worker-bb94fa4efc, status: closed, memory: 0, processing: 222>

One reason why it takes a lot of time is that workers keep straggling when reaching the disk size limit and we wait a while before deciding to re-assign those tasks to other workers (and removing said worker). (Cluster ID for inspection: https://cloud.coiled.io/dask-engineering/clusters/48344/details)

hendrikmakait commented 2 years ago

One thing that might be helpful for the failing case is monitoring of disk usage for each worker. We currently monitor how much data is spilled, but we do not track/display disk used/disk free anywhere.

coiled / benchmarks

Integration tests: spill/unspill #136

Persist a lot of data on disk

Pseudocode

135

140