dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Unable to allocate 5.27 EiB ... when trying to access cluster dashboard through wrong URL #8368

Closed jonashaag closed 2 weeks ago

jonashaag commented 11 months ago

Sorry for screenshot, I don't have copy and paste or GitHub access on that machine.

Describe the issue:

When you try to open the dashboard through the link printed by print(client), you trigger this exception in the scheduler

Minimal Complete Verifiable Example:

client = Client()
print(client) # Prints a tcp:// URL that's NOT the dashboard URL

Try to open that URL in the browser (I thought it's the dashboard URL).

Anything else we need to know?:

Environment:

fjetter commented 11 months ago

What browser are you using for this?

If I just put in the IP, I get errors like ERR_INVALID_HTTP_RESPONSE or similar for chrome and safari

Firefox actually manages to get through to the server and just prints the handshake info (the stuff we're sending over the network to a connecting server)

image

I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.

jonashaag commented 11 months ago

Ah interesting, didn't consider this.

Chrome through JupyterLab proxy.

RaiinmakerWes commented 3 months ago

I ran into a very similar issue today. tl;dr Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8

Running Docker images for all three distributed layers:

def inc(x): print('yay') return x + 1

client = Client('xx.xxx.xx.xxx:8786')

x = client.submit(inc, 10)

L = client.map(inc, range(1000))

print('x result', x.result()) print ('L gather', client.gather(L))


I see the printed "yay" in my local worker, and I see the task completion debug logs in my scheduler.
However, gathering the results fails with

2024-07-22 16:22:47,876 - distributed.core - DEBUG - Message from 'tcp://[local_static_ip]:61927': {'op': 'gather', 'keys': ('inc-0a4704b07c1765924dc76f5c705ae806',), 'reply': True} 2024-07-22 16:22:47,876 - distributed.core - DEBUG - Calling into handler gather 2024-07-22 16:22:47,877 - distributed.comm.core - DEBUG - Establishing connection to [ngrok_endpoint_address]:15721 2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2 2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP user timeout: 30000 ms 2024-07-22 16:22:47,922 - distributed.utils_comm - ERROR - Unexpected error while collecting tasks ['inc-0a4704b07c1765924dc76f5c705ae806'] from tcp://[ngrok_endpoint_address]:15721 Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 459, in retry_operation return await retry( File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 438, in retry return await coro() File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 2866, in get_data_from_worker comm = await rpc.connect(worker) File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1533, in connect return connect_attempt.result() File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1423, in _connect comm = await connect( File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 377, in connect handshake = await comm.read() File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 227, in read frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes) File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 359, in read_bytes_rw buf = host_array(n) File "/opt/conda/lib/python3.10/site-packages/distributed/protocol/utils.py", line 29, in host_array return numpy.empty((n,), dtype="u1").data numpy._core._exceptions._ArrayMemoryError: Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8 2024-07-22 16:22:47,924 - distributed.scheduler - ERROR - Couldn't gather keys: {'inc-0a4704b07c1765924dc76f5c705ae806': 'memory'}



I should also mention that I am able to see the tcp connections from the GCP scheduler into my local worker at the gather step by monitoring the ngrok tunnel stats, so I was able to verify connectivity. 

>I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.

Does my error line up with your expectations here? Any ideas on why this might be happening here?
fjetter commented 3 months ago

I'm not familiar with ngrok so I can't tell what's going on in your case.

The way I think the original exception was triggered is that the browser connected to the dask server and the dask server tried to engage in its application side handshake (where it is reading and writing things to the TCP socket). However, instead of receiving plain bytes that correspond to our protocol, it encountered a certain HTTP message that ended up triggering this exception (our protocol is using the first couple of bytes in a message to infer how much data is incoming and we're using this information to efficiently allocate memory. If the first couple of bytes are anything else / random bytes this is easily interpreted as a very big integer).

I'm not sure what ngrok does but if it is changing the bytestream even slightly, this could cause such an exception. It could also happen if it is erroneously thinking this connection is using HTTP

RaiinmakerWes commented 3 months ago

Ah, I see - that makes sense. I bet there is a connection problem in the scheduler -> ngrok -> worker direction and the error payload is triggering this.

Thanks for the insight :)

zoltan commented 2 months ago

just starting dask scheduler --host 0.0.0.0 in a conda environment and then trying to access http://ip:8786 will result in this on 2024.7.1 from conda-forge.

dimm0 commented 2 weeks ago

Any updates? Followed the guide to provision a new cluster with k8s operator and hitting this error

jacobtomlinson commented 2 weeks ago

As @fjetter says I think a lot of people landing on this issue are coming here because this error happens when you try and open the Dask TCP port used the communication in a browser.

Reproducer steps

This results in the :��������������*�������ƒ«compressionÀ¦python“ ¯pickle-protocol message in the browser and the numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.40 EiB for an array with shape (8530211521808319815,) and data type uint8 exception in the scheduler.

This is expected behaviour. You're openening a TCP only connection in a web browser. If you're trying to access the dashboard you need to connect to a different port http://localhost:8787.

The discussion about ngrok is interesting. Ngrok supports HTTP proxying (layer 7) and TCP proxying (layer 4). They support both modes as there are pros/cons to each, see this article to learn more. I assume that folks who are running into issues are using HTTP proxying instead of TCP proxying, which results in the same error as when you open the TCP port in a browser. The fix for this should just be to use the TCP proxying.

I'm going to close this issue out as "wontfix" as hopefully this comment solves most folks problems. I've also opened #8905 to track improving the failure mode of opening the TCP port in a browser.

If there are still ngrok related issues that happens when using the TCP proxying then I encourage folks to open a new issue with steps to reproduce the issue so we can look further into it.