Closed jonashaag closed 2 weeks ago
What browser are you using for this?
If I just put in the IP, I get errors like ERR_INVALID_HTTP_RESPONSE
or similar for chrome and safari
Firefox actually manages to get through to the server and just prints the handshake info (the stuff we're sending over the network to a connecting server)
I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.
Ah interesting, didn't consider this.
Chrome through JupyterLab proxy.
I ran into a very similar issue today. tl;dr Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8
Running Docker images for all three distributed layers:
Google Cloud VM running the scheduler via docker run --network host --mount type=bind,source="$(pwd)"/dask-env.yaml,target=/etc/dask/dask-env.yaml,readonly --name scheduler --rm ghcr.io/dask/dask dask-scheduler
docker run -p 13370:13370 ghcr.io/dask/dask dask worker --contact-address tcp://n.tcp.xx-xxx-n.ngrok.io:15721 --listen-address tcp://localhost:13370 tcp://xx.xxx.xx.xxx:8786
docker run -p 8888:8888 ghcr.io/dask/dask-notebook
I open the Jupyter notebook in Google Chrome via the http://127.0.0.1:8888/lab?token=5ff5....
output in Docker, then add a cell in the notebook to connect to the scheduler and run work.
import dask
from dask.distributed import Client
def inc(x): print('yay') return x + 1
client = Client('xx.xxx.xx.xxx:8786')
x = client.submit(inc, 10)
L = client.map(inc, range(1000))
print('x result', x.result()) print ('L gather', client.gather(L))
I see the printed "yay" in my local worker, and I see the task completion debug logs in my scheduler.
However, gathering the results fails with
2024-07-22 16:22:47,876 - distributed.core - DEBUG - Message from 'tcp://[local_static_ip]:61927': {'op': 'gather', 'keys': ('inc-0a4704b07c1765924dc76f5c705ae806',), 'reply': True} 2024-07-22 16:22:47,876 - distributed.core - DEBUG - Calling into handler gather 2024-07-22 16:22:47,877 - distributed.comm.core - DEBUG - Establishing connection to [ngrok_endpoint_address]:15721 2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP keepalive: nprobes=10, idle=10, interval=2 2024-07-22 16:22:47,900 - distributed.comm.tcp - DEBUG - Setting TCP user timeout: 30000 ms 2024-07-22 16:22:47,922 - distributed.utils_comm - ERROR - Unexpected error while collecting tasks ['inc-0a4704b07c1765924dc76f5c705ae806'] from tcp://[ngrok_endpoint_address]:15721 Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 459, in retry_operation return await retry( File "/opt/conda/lib/python3.10/site-packages/distributed/utils_comm.py", line 438, in retry return await coro() File "/opt/conda/lib/python3.10/site-packages/distributed/worker.py", line 2866, in get_data_from_worker comm = await rpc.connect(worker) File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1533, in connect return connect_attempt.result() File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1423, in _connect comm = await connect( File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 377, in connect handshake = await comm.read() File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 227, in read frames_nosplit = await read_bytes_rw(stream, frames_nosplit_nbytes) File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 359, in read_bytes_rw buf = host_array(n) File "/opt/conda/lib/python3.10/site-packages/distributed/protocol/utils.py", line 29, in host_array return numpy.empty((n,), dtype="u1").data numpy._core._exceptions._ArrayMemoryError: Unable to allocate 6.34 EiB for an array with shape (7311138144931639129,) and data type uint8 2024-07-22 16:22:47,924 - distributed.scheduler - ERROR - Couldn't gather keys: {'inc-0a4704b07c1765924dc76f5c705ae806': 'memory'}
I should also mention that I am able to see the tcp connections from the GCP scheduler into my local worker at the gather step by monitoring the ngrok tunnel stats, so I was able to verify connectivity.
>I can see how your exception can be triggered from our server code but the browser must send something rather specific to trigger such a response.
Does my error line up with your expectations here? Any ideas on why this might be happening here?
I'm not familiar with ngrok so I can't tell what's going on in your case.
The way I think the original exception was triggered is that the browser connected to the dask server and the dask server tried to engage in its application side handshake (where it is reading and writing things to the TCP socket). However, instead of receiving plain bytes that correspond to our protocol, it encountered a certain HTTP message that ended up triggering this exception (our protocol is using the first couple of bytes in a message to infer how much data is incoming and we're using this information to efficiently allocate memory. If the first couple of bytes are anything else / random bytes this is easily interpreted as a very big integer).
I'm not sure what ngrok does but if it is changing the bytestream even slightly, this could cause such an exception. It could also happen if it is erroneously thinking this connection is using HTTP
Ah, I see - that makes sense. I bet there is a connection problem in the scheduler -> ngrok -> worker direction and the error payload is triggering this.
Thanks for the insight :)
just starting dask scheduler --host 0.0.0.0
in a conda environment and then trying to access http://ip:8786 will result in this on 2024.7.1 from conda-forge.
Any updates? Followed the guide to provision a new cluster with k8s operator and hitting this error
As @fjetter says I think a lot of people landing on this issue are coming here because this error happens when you try and open the Dask TCP port used the communication in a browser.
Reproducer steps
dask scheduler
This results in the :��������������*�������ƒ«compressionÀ¦python“¯pickle-protocol
message in the browser and the numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.40 EiB for an array with shape (8530211521808319815,) and data type uint8
exception in the scheduler.
This is expected behaviour. You're openening a TCP only connection in a web browser. If you're trying to access the dashboard you need to connect to a different port http://localhost:8787.
The discussion about ngrok is interesting. Ngrok supports HTTP proxying (layer 7) and TCP proxying (layer 4). They support both modes as there are pros/cons to each, see this article to learn more. I assume that folks who are running into issues are using HTTP proxying instead of TCP proxying, which results in the same error as when you open the TCP port in a browser. The fix for this should just be to use the TCP proxying.
I'm going to close this issue out as "wontfix" as hopefully this comment solves most folks problems. I've also opened #8905 to track improving the failure mode of opening the TCP port in a browser.
If there are still ngrok related issues that happens when using the TCP proxying then I encourage folks to open a new issue with steps to reproduce the issue so we can look further into it.
Sorry for screenshot, I don't have copy and paste or GitHub access on that machine.
Describe the issue:
When you try to open the dashboard through the link printed by
print(client)
, you trigger this exception in the schedulerMinimal Complete Verifiable Example:
Try to open that URL in the browser (I thought it's the dashboard URL).
Anything else we need to know?:
Environment: