Closed fjetter closed 2 years ago
This was mostly done by Naty and Hendrik. Thinks were fine.
On Wed, May 25, 2022 at 4:26 AM Florian Jetter @.***> wrote:
Persist a lot of data on disk
What happens when we load 10x more data than we have RAM?
- Does Coiled have enough storage by default? How do we modify this?
- How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?
Pseudocode
x = da.random.random(...).persist() # load the data wait(x)x.sum().compute() # force us to read the data again
— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTALVTHGIYJHKFEZMOLVLXW2RANCNFSM5W4OUKYQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
Matthew Rocklin CEO
*Things were fine
On Wed, May 25, 2022 at 8:38 AM Matthew Rocklin @.***> wrote:
This was mostly done by Naty and Hendrik. Thinks were fine.
On Wed, May 25, 2022 at 4:26 AM Florian Jetter @.***> wrote:
Persist a lot of data on disk
What happens when we load 10x more data than we have RAM?
- Does Coiled have enough storage by default? How do we modify this?
- How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?
Pseudocode
x = da.random.random(...).persist() # load the data wait(x)x.sum().compute() # force us to read the data again
— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTALVTHGIYJHKFEZMOLVLXW2RANCNFSM5W4OUKYQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
--
Matthew Rocklin CEO
--
Matthew Rocklin CEO
Does Coiled have enough storage by default? How do we modify this?
Currently the default is 100GiB EBS, but if there's a local NVMe we'll attach this to /scratch
and set dask to use that for temp storage (so, spill).
There isn't currently a way to adjust size of EBS, I'd be interested to know if there's desire/need for that. This means that currently best way to get large disk (and fast disk for disk-intensive workloads) is using instance with NVMe—for example, something from i3
family.
This was mostly done by Naty and Hendrik
@ncclementi where did this work end up?
We tried to persist a lot of data with @hendrikmakait and we were able to do so, things didn't crash, and in the process we discover something that lead to this https://github.com/dask/distributed/pull/6280
We experimented with this, but we did not write a test. We were not quite sure what the test would look like, as the EBS kept expanding.
the EBS kept expanding
As in, the amount written to disk was expanding? Or disk size was expanding? (That would make me very puzzled.)
@ntabris I think I made a bad choice of words, we just saw that it kept spilling and it seemed to never end but I don't recall how close did we get to 100GB, @hendrikmakait Do you remember?
This was mostly done by Naty and Hendrik. Thinks were fine.
I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.
we were able to do so, things didn't crash, and in the process we discover something that lead to this
Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable
I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.
I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like
da.random.random((1_000_000, 1_00000), chunks=(10000, 1000))
which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it.
Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable
I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93
Yup. I think that that would work fine. You would probably persist, wait, and then call sum() or something else that pulled the data out of memory. That would be a good thing to test and time
On Tue, May 31, 2022 at 12:24 PM Naty Clementi @.***> wrote:
I would actually like to have a test as part of the coiled runtime to not only confirm that this is not an issue right now but it isn't ever becoming an issue.
I think this sounds reasonable, the part I am struggling with is how to design a test for this. What is the expected, and what is an issue? At the moment we did something like da.random.random((1_000_000, 1_00000), chunks=(10000, 1000)) which is an array of approximately 750GB and try to persist it on a default cluster. But there was not a proper test design around it.
Is the code that led to this still available? Just because it didn't crash doesn't mean it isn't valuable
I do not have the exact code that we run at the moment, we were experimenting on an ipython session, but Guido was able to reproduce this, and created a test for it, that is on the PR. https://github.com/dask/distributed/pull/6280/files#diff-96777781dd54f26ed9441afb42909cf6f5393d6ef0b2b2a2e7e8dc46f074df93
— Reply to this email directly, view it on GitHub https://github.com/coiled/coiled-runtime/issues/136#issuecomment-1142411110, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTAFIOGSO32KCLFXKO3VMZDNTANCNFSM5W4OUKYQ . You are receiving this because you commented.Message ID: @.***>
--
Matthew Rocklin CEO
While working on this, dask/distributed#6783 was identified, which makes it hard to gauge disk space usage on workers.
XREF: dask/distributed/pull/6835 makes it easier to evaluate disk I/O
What happens when we load 10x more data than we have RAM?
It depends as we will see below.
Does Coiled have enough storage by default?
Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO:
Previously we attached a 100GB disk to every instance, now the default size will be between 30GB and 100GB and depends on how much memory (RAM) the instance has.
For example, the default t3.medium
instance, which has 4 GB of RAM, runs ouf of disk space around 26.5 GB of spillage. In the worker logs, we then find output like this:
Aug 5 10:46:50 ip-10-0-15-123 rsyslogd: action 'action-3-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug 5 10:46:50 ip-10-0-15-123 rsyslogd: file '/var/log/syslog'[7] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2001.0 try https://www.rsyslog.com/e/2027 ]
Aug 5 10:46:52 ip-10-0-15-123 dockerd[887]: time="2022-08-05T10:46:52.205685231Z" level=error msg="Failed to log msg \"\" for logger json-file: error writing log entry: write /var/lib/docker/cAug 5 10:46:52 ip-10-0-15-123 cloud-init[1318]: self.d[key] = pickled
Aug 5 10:46:52 ip-10-0-15-123 cloud-init[1318]: File "/opt/conda/envs/coiled/lib/python3.10/site-packages/zict/file.py", line 101, in __setitem__
Aug 5 10:46:52 ip-10-0-15-123 cloud-init[1318]: fh.writelines(value)
Aug 5 10:46:52 ip-10-0-15-123 cloud-init[1318]: OSError: [Errno 28] No space left on device
For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:
2022-08-05 12:38:04,691 - distributed.protocol.pickle - INFO - Failed to deserialize b"\x80\x04\x95\xad\x19\x00\x00\x00\x00\x00\x00\x8c\x15distributed.scheduler\x94\x8c\x0cKilledWorker\x94\x93\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 23)\x94h\x00\x8c\x0bWorkerState\x94\x93\x94)\x81\x94N}\x94(\x8c\x07address\x94\x8c\[x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18](x17tls://10.0.15.123:40069\x94\x8c\x03pid\x94K7\x8c\x04name\x94\x8c\x1fhendrik-debug-worker-78663614bd\x94\x8c\x08nthreads\x94K\x02\x8c\x0cmemory_limit\x94\x8a\x05\x00\xc0\x9e\xf1\x00\x8c\x0flocal_directory\x94\x8c*/scratch/dask-worker-space/worker-r_h553hi\x94\x8c\x08services\x94}\x94\x8c\tdashboard\x94M\xd3\x99s\x8c\x08versions\x94}\x94\x8c\x05nanny\x94\x8c\x17tls://10.0.15.123:46585\x94\x8c\x06status\x94\x8c\x10distributed.core\x94\x8c\x06Status\x94\x93\x94\x8c\x06closed\x94\x85\x94R\x94\x8c\x05_hash\x94\x8a\x08\x1a\x82\xf0&E\x92(\xa2\x8c\x06nbytes\x94K\x00\x8c\toccupancy\x94K\x00\x8c\x15_memory_unmanaged_old\x94K\x00\x8c\x19_memory_unmanaged_history\x94\x8c\x0bcollections\x94\x8c\x05deque\x94\x93\x94)R\x94\x8c\x07metrics\x94}\x94\x8c\tlast_seen\x94K\x00\x8c\ntime_delay\x94K\x00\x8c\tbandwidth\x94J\x00\xe1\xf5\x05\x8c\x06actors\x94\x8f\x94\x8c\t_has_what\x94}\x94\x8c\nprocessing\x94}\x94(h\x03G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9(%27random_sample-950bfce4388763a088b491ac651d6b18)', 6, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 6, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 7, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 7, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 8, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 8, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 0)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 1)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 10)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 11)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 12)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 13)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 14)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 15)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 16)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 17)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 18)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 19)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 2)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 20)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 21)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 22)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 23)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 9, 24)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 3)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 4)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 5)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 6)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 7)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 8)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8\x8c8('random_sample-950bfce4388763a088b491ac651d6b18', 9, 9)\x94G?\xd9\xab\x8d\x1c\xa5i\xb8u\x8c\x0clong_running\x94\x8f\x94\x8c\texecuting\x94}\x94\x8c9('random_sample-950bfce4388763a088b491ac651d6b18', 6, 22)\x94G?\xd8hh\x00\x00\x00\x00s\x8c\tresources\x94}\x94\x8c\x0eused_resources\x94}\x94\x8c\x05extra\x94}\x94\x8c\tserver_id\x94\x8c+Worker-e1d62c65-e3d3-41ff-bcd1-97f8dafeeeea\x94u\x86\x94b\x86\x94R\x94."
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads
return pickle.loads(x)
AttributeError: 'WorkerState' object has no attribute 'server_id'
Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id'
looks like a bug on its own.
How do we modify this?
We can use worker_disk_size
when creating a cluster to allocate sufficient disk space. If we choose a sufficiently large size, everything works as expected.
How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?
I'd like dask/distributed#6835 merged first before diving into this since we cannot copy values of the WorkerTable
.
How are we doing against theoretical performance? Is Dask's spill-to-disk efficient?
In terms of hard numbers, we reach ~125 MiB on both exclusive read
and write
. It looks like performance numbers for EBS with EC2 instances are hard to come by (at the very least for the avg. throughput). To check against theoretical performance, we need to perform some experiments on raw EC2 instances.
One caveat I found is that if we keep referencing the large array and then calculate a sum on it, we see ~62 MiB of read and write at the same time. It looks like we are unspilling a chunk to calculate a sum on it, then spilling it back to disk because we need that memory for another chunk. Given the immutability of the chunks, we may want to consider a lazier policy that keeps the spilled data on disk until it should remove it, i.e. it wants to get rid of them both in memory and on disk.
In addition to the trivial use case above, we would also like to have https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085 scaled up to production size.
After taking the first shot at this, it looks like scaling tensordot_stress
up is more of a memory than a disk problem:
Aug 5 14:30:29 ip-10-0-5-24 kernel: [ 708.236580] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,mems_allowed=0,global_oom,task_memcg=/docker/c391e262f16fa83e76cb194623ffd8e4a9599f544effa9e34a78135b799ea07d,task=python,pid=1914,uid=0
Aug 5 14:30:29 ip-10-0-5-24 kernel: [ 708.236609] Out of memory: Killed process 1914 (python) total-vm:4358956kB, anon-rss:3259096kB, file-rss:0kB, shmem-rss:8kB, UID:0 pgtables:6660kB oom_score_adj:0
Does Coiled have enough storage by default?
Since https://docs.coiled.io/user_guide/cloud_changelog.html#june-2022, the answer to this question is mostly NO[...]
To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue. The discussion around default disk sizes can be found here: https://github.com/coiled/oss-engineering/issues/123. Following the discussion on that issue and given how easy it is to adjust disk size with worker_disk_size
, the default seems to be reasonable to me.
To clarify: I mean that we do not have enough storage by default to store 10x more data than we have in RAM, which is the initial question of this issue
That's fine. I think any multiplier >1 is fine assuming we can configure this on coiled side. The idea of this issue is to apply a workload that requires more memory than there is available on the cluster but can finish successfully if data is stored to disk. Whether this is 1.5x, 2x or 10x is not that important
After taking the first shot at this, it looks like scaling tensordot_stress up is more of a memory than a disk problem:
Also an interesting find. If chunks are small enough this should always be able to finish. There is an edge case where disk is full, memory is full and the entire cluster pauses. Beyond this edge case, the computation should always finish and we should definitely never see an OOM exception, iff the chunks are small enough
@fjetter: I'm currently taking a look at what's going on in the tensordot_stress
test, will keep you posted.
For the user, this means that the task eventually fails enough times and they receive a bunch of the following messages:
Traceback (most recent call last): File "/opt/homebrew/Caskroom/mambaforge/base/envs/coiled-runtime/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 66, in loads return pickle.loads(x) AttributeError: 'WorkerState' object has no attribute 'server_id'
Note: This is not helpful at all and AttributeError: 'WorkerState' object has no attribute 'server_id' looks like a bug on its own.
This was caused by a version mismatch with my custom software environment that used the latest distributed
from main
. After fixing that, it takes a lot of time for all tasks to either end up in-memory
or erred
, but when they do, the erred ones return a KilledWorkerException
:
distributed.scheduler.KilledWorker("('random_sample-98bbea72b3b5abb3f8bf909c540c55f0', 0, 428)",
<WorkerState 'tls://10.0.3.246:45495', name: hendrik-debug-worker-bb94fa4efc, status: closed, memory: 0, processing: 222>
One reason why it takes a lot of time is that workers keep straggling when reaching the disk size limit and we wait a while before deciding to re-assign those tasks to other workers (and removing said worker). (Cluster ID for inspection: https://cloud.coiled.io/dask-engineering/clusters/48344/details)
One thing that might be helpful for the failing case is monitoring of disk usage for each worker. We currently monitor how much data is spilled, but we do not track/display disk used
/disk free
anywhere.
Persist a lot of data on disk
What happens when we load 10x more data than we have RAM?
Pseudocode
[EDIT by @crusaderky ] In addition to the trivial use case above, we would also like to have https://github.com/dask/distributed/blob/f7f650154fea29978906c65dd0225415da56ed11/distributed/tests/test_active_memory_manager.py#L1079-L1085
scaled up to production size. This will stress the important use case of spilled tasks that are taken out of the spill file and back into memory not to be computed, but to the transferred to another worker. This stress test should find a sizing that is spilling/unspilling heavily but is still completing successfully. Related:
135
140