dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.56k stars 715 forks source link

What might cause a a dask graph to freeze/hang when there's only 1 worker but work with no problems when there's 2 workers or above #3730

Open CMCDragonkai opened 4 years ago

CMCDragonkai commented 4 years ago

In my application where I've written a test fixture to setup a dask cluster.

I noticed that when my local cluster only had 1 worker in it. The dask graph would compute up to a point and then just hang forever with no errors. The dashboard just shows the whole graph waiting on several tasks in the middle of the whole graph.

Debugging this led to nowhere. So I just added a second worker. As soon as the cluster had 2 workers, the graph completed.

The workers have no memory limit or --memory-limit=0 and there are no constraints on the execution.

Unfortunately the graph is quite large, and so I'm not able to replicate it in this issue.

jcrist commented 4 years ago

Hi @CMCDragonkai,

The workers have no memory limit or --memory-limit=0 and there are no constraints on the execution.

I'm having a bit of trouble parsing this sentence. Do you mean:

The dask graph would compute up to a point and then just hang forever with no errors. The dashboard just shows the whole graph waiting on several tasks in the middle of the whole graph.

Can you describe how it hangs? Are the tasks still computing, or is nothing happening? How is your cluster set up? Do you have multiple threads per worker?

CMCDragonkai commented 4 years ago

The memory limit is 0. There are still pending tasks. Some tasks have completed. The cluster was 1 scheduler, 1 worker with 1 thread. This is when it always freezed. Then with 2 workers each with 1 thread, it worked with no problems.

On 21 April 2020 08:37:38 GMT+10:00, Jim Crist-Harif notifications@github.com wrote:

Hi @CMCDragonkai,

The workers have no memory limit or --memory-limit=0 and there are no constraints on the execution.

I'm having a bit of trouble parsing this sentence. Do you mean:

  • You passed in --memory-limit=0
  • You didn't pass anything for --memory-limit, which defaults to auto, a memory limit based on the memory available on the system?

The dask graph would compute up to a point and then just hang forever with no errors. The dashboard just shows the whole graph waiting on several tasks in the middle of the whole graph.

Can you describe how it hangs? Are the tasks still computing, or is nothing happening? How is your cluster set up? Do you have multiple threads per worker?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dask/distributed/issues/3730#issuecomment-616847875

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

jcrist commented 4 years ago

What happens if you give the one worker 2 threads?

CMCDragonkai commented 4 years ago

Nope with 1 process 2 threads, it still doesn't work:

image

CMCDragonkai commented 4 years ago

With 2 processes it then works perfectly:

image

quasiben commented 4 years ago

Is it possible to call wait/persist earlier in the task generation and batch the job into one worker to help debug ?