dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.55k stars 712 forks source link

distributed upgrade from 2022.03.0 to 2024.2.0 has performance issues. #8646

Closed yiershanxll closed 1 month ago

yiershanxll commented 1 month ago

Problem: We tested 5 times, and each time the problem occurred at 25 minutes. the error message "distributed.comm.core.CommClosedError: in <TLS (closed) Scheduler Broadcast local=tls://182.10.4.6:58090 remote=tls://182.10.2.6:18715>: Stream is closed" is displayed. This problem does not exist in earlier versions:dask==2022.03.0. Although the task is error, the background worker executes the task properly until the calculation is complete.

Environment information:

Number of nodes: Two containers with 8 vCPUs and 16 GB memory are deployed. Number of workers: Two workers are started using the dask command. Each worker has five processes and one thread. The memory usage is limited to 90%. A total of 10 processes are processed in the background. Distributed computing: We use the client.run method to submit tasks to each worker for processing. The input processed by each worker is a file. Pandas is used for processing, and dask.dataframe is not used. The output is also a file.

yiershanxll commented 1 month ago

distributed.yaml worker-ttl param need to set null

fjetter commented 1 month ago

Just driving by: Client.run is not necessarily meant for users to run their computations. This is mostly used for diagnostics purposes, debugging and occasionally for more exotic things. As the docs for Client.run already suggests, this function is running outside of the task scheduling system.

Users should instead use Client.submit to schedule individual functions.

You will also noticed that with Client.run, the dashboard is not actually working just like many other features will not work