dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Dask really slow towards the end of the job #2835

Closed ohashmi1 closed 3 weeks ago

ohashmi1 commented 5 years ago

Hi I am using dask distributed with 300 workers. I have a cluster of 32 machines with over 300 cpus and I want the computation to be distributed across all workers. Currently when i run my job, it works fairly well and distributes all the tasks, however towards the end, the workers stop working and there is only one or two that has a high cpu usage.

Generally I have a 8000 timeseries that I want to process and I want the computation to be distributed so I use dask delayed, and then call compute on the list.

 predictions = nostradamus.run_prophet(data,
               brands_list=brands_1
            )
  x = client.compute(predictions, retries=10)
  wait(x)
  ds =[i.result() for i in x] . # constructs a list of dataframes

I have the following images from the dashboard. I just don't understand why it stops distributing at the very end.

Any help would be appreciated!

Screen Shot 2019-07-12 at 11 21 24 AM Screen Shot 2019-07-12 at 11 21 34 AM
mrocklin commented 5 years ago

You might try looking at what the workers are working on using the Client.call_stack method or looking in the worker pages in the "Info" tab of the dashboard.

As always, if you're able to provide a minimal reproducible example that's often a helpful approach. See this blog

ohashmi1 commented 5 years ago

I look at the info tab but it just shows that none of them are processing anything. I am pretty new to dask, don't really know how to debug. Not sure what to look for in call stack. Could it because I persist dataframes in memory? The memory is never 100% used tho. It only happens at the very end, workers progressively do less work as I advance my computation.

There isnt a whole lot of code i can reproduce. It also only happens when i have a large dataset. It works fine on smaller sets.

Screen Shot 2019-07-12 at 12 30 27 PM
mrocklin commented 5 years ago

In that case I recommend trying to produce a minimal reproducible example so that maintainers are able to help you

On Fri, Jul 12, 2019 at 11:33 AM ohashmi1 notifications@github.com wrote:

I look at the info tab but it just shows that none of them are processing anything. I am pretty new to dask, don't really know how to debug. Not sure what to look for in call stack. Could it because I persist dataframes in memory? The memory is never 100% used tho. It only happens at the very end, workers progressively do less work as I advance my computation.

There isnt a whole lot of code i can reproduce. It also only happens when i have a large dataset. It works fine on smaller sets.

[image: Screen Shot 2019-07-12 at 12 30 27 PM] https://user-images.githubusercontent.com/1936054/61143672-e37f3c80-a4a0-11e9-98d3-378c1e750404.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2835?email_source=notifications&email_token=AACKZTHAPA5Y2GR4DZAWPD3P7CW47A5CNFSM4ICMCFRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ2H5UA#issuecomment-510951120, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTF26EJFHODV32FNB2DP7CW47ANCNFSM4ICMCFRA .

birdsarah commented 5 years ago

I think I'm getting something similar. I ended up with workers that looked like below.

One worker has collected a whole bunch of tasks that are all in processing but not doing anything and then nothing else is processing.

Manually using client.retire_worker helped to unstick things for me.

There were definitely tasks on the queue for the worker that I retired that had been put there after another worker died as one I clicked on had been marked "suspicious: 1" which i believe happens when a task was running previously on a killedworker.

Screenshot from 2019-07-25 22-46-56

mrocklin commented 5 years ago

That's odd. Do you happen to have logs from that worker?

I imagine that this is hard, but if anyone has a reproducible example, that would be helpful.

On Thu, Jul 25, 2019 at 9:06 PM Sarah Bird notifications@github.com wrote:

I think I'm getting something similar. I ended up with workers that looked like below.

One worker has collected a whole bunch of tasks that are all in processing but not doing anything and then nothing else is processing.

Manually using client.retire_worker helped to unstick things for me.

There were definitely tasks on the queue for the worker that I retired that had been put there after another worker died as one I clicked on had been marked "suspicious: 1" which i believe happens when a task was running previously on a killedworker.

[image: Screenshot from 2019-07-25 22-46-56] https://user-images.githubusercontent.com/1796208/61925123-72279b00-af30-11e9-9566-b1233e81ec65.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2835?email_source=notifications&email_token=AACKZTBFT2OF6Y6JMLRXWETQBJZ2TA5CNFSM4ICMCFRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23N3QY#issuecomment-515300803, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTH37N2VP2BAUBUFR7LQBJZ2TANCNFSM4ICMCFRA .

mrocklin commented 5 years ago

Also, getting the output of Client.call_stack (also available through the worker info pages) would be helpful in determining what that odd worker is up to.

On Fri, Jul 26, 2019 at 7:35 AM Matthew Rocklin mrocklin@gmail.com wrote:

That's odd. Do you happen to have logs from that worker?

I imagine that this is hard, but if anyone has a reproducible example, that would be helpful.

On Thu, Jul 25, 2019 at 9:06 PM Sarah Bird notifications@github.com wrote:

I think I'm getting something similar. I ended up with workers that looked like below.

One worker has collected a whole bunch of tasks that are all in processing but not doing anything and then nothing else is processing.

Manually using client.retire_worker helped to unstick things for me.

There were definitely tasks on the queue for the worker that I retired that had been put there after another worker died as one I clicked on had been marked "suspicious: 1" which i believe happens when a task was running previously on a killedworker.

[image: Screenshot from 2019-07-25 22-46-56] https://user-images.githubusercontent.com/1796208/61925123-72279b00-af30-11e9-9566-b1233e81ec65.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/2835?email_source=notifications&email_token=AACKZTBFT2OF6Y6JMLRXWETQBJZ2TA5CNFSM4ICMCFRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23N3QY#issuecomment-515300803, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTH37N2VP2BAUBUFR7LQBJZ2TANCNFSM4ICMCFRA .

rbubley commented 5 years ago

I’ve also had this, and in my case it seemed to be some sort of race condition in the scheduler. For me, it was fixed by moving the scheduler from toolz to cytoolz, which also seemed to imply it was some sort of timing issue.

birdsarah commented 5 years ago

moving the scheduler from toolz to cytoolz

@rbubley can you share info on how you did that?

On July 28, 2019 11:56:51 AM CDT, Russ Bubley notifications@github.com wrote:

I’ve also had this, and in my case it seemed to be some sort of race condition in the scheduler. For me, it was fixed by moving the scheduler from toolz to cytoolz, which also seemed to imply it was some sort of timing issue.>

-- > You are receiving this because you commented.> Reply to this email directly or view it on GitHub:> https://github.com/dask/distributed/issues/2835#issuecomment-515778163

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

rbubley commented 5 years ago

@birdsarah, it is literally as simple as a pip/conda install of cytoolz: distributed attempts to import cytoolz in preference to toolz.

(Various bits of the distributed library code uses this pattern:

try:
  from cytoolz import reduce
except ImportError:
  from toolz import reduce

)

birdsarah commented 5 years ago

In a recent run with the same problem. I had about 450 jobs of 100k plus remaining and listed as ready but they weren't in any worker. Call to client.call_stack returned empty dict.

I discovered that, presumably due to some packaging conflicts, I had both toolz and cytoolz installed. I have removed toolz and am going to see what happens.

birdsarah commented 5 years ago

Oh....okay so the option is toolz only or toolz and cytoolz. (dask won't run without plain toolz)

birdsarah commented 5 years ago

Hold moly that seems to have done the trick. Have only got a couple of results and so am loathed to get too excited, but I think you might be on for hero of the week @rbubley.

@mrocklin I also observed the loads redistributing as you described once I removed cytoolz.

mrocklin commented 5 years ago

Just so I'm clear, was there a mismatch in environments where some workers/clients/schedulers had cytoolz and some didn't, or were environments homogeneous before, and another homogeneous change in those environments helped restore things?

rbubley commented 5 years ago

@mrocklin, in my case, they were homogeneous (no cytoolz on client/scheduler/worker) and I saw the problem; the problem was fixed when it became inhomogeneous (cytoolz only on scheduler). But my understanding is that cytoolz is supposed to have the same results as toolz (up to the documented return values; non-guaranteed orderings can change), and conceptually this shouldn't be a library where homogeneity is required. (Although something subtle around orderings is always possible.)

What we should expect is that changes between toolz and cytoolz affects timings, and could therefore surface previously unobserved race conditions. Given the observed effects, i.e. some workers being idle when they shouldn't be, I would guess that the issue is with the stealing code - perhaps some series of actions is assumed to be atomic when it isn't.

birdsarah commented 5 years ago

Things were homogenous. What I meant was that on initial environment creation dependency management had meant that both toolz and cytoolz were created. That was then packaged with conda pack and shipped to all workers. When I removed cytoolz, and again repackaged with conda pack and restarted DaskYarn cluster things improved.

mrocklin commented 5 years ago

So @rbubley 's situation improved when he added cytoolz and @birdsarah 's situation improved when she removed it?

birdsarah commented 5 years ago

To be clear I couldn't test having only cytoolz in my environment. dask / daskyarn wouldn't import.

Edit: I couldn't readily test.

birdsarah commented 5 years ago

I have been reliably removing cytoolz from my environment for a while.

Just to note that today I experienced this issue for the first time in a while, with cytoolz removed,. It may be unrelated, but thought I'd document. It was on write_parquet tasks which seem to have completed but not registered as such. When I retire the affected worker the job can finish with no apparent loss of data.

nsmith- commented 4 years ago

I'm not 100% sure if I am seeing the same thing discussed here, but often I see one or a handful of jobs stalled in "processing" on some number of workers. If I go to the task in the dashboard I see, e.g.

Status  processing
Processing on   tls://10.128.2.162:40554
Call stack <link>
Priority    (0, 1, 360)
Retries     3 

If I click the link to the call stack I see Task not actively running. It may be finished or not yet started which seems odd. If I retire the worker (e.g. in this case, client.retire_workers(["tls://10.128.2.162:40554"])), then the task gets unstuck and the job completes.

Zaharid commented 3 years ago

I am having the same problem.

Some task (of several similar ones launched with client.map so nothing particular) gets stuck as described above by others, while the worker or the scheduler is not using much cpu and trying to get the call stack reports the task as not running. Shutting the worker seems to help.

It happens quite frequently to the point that dask is hard to use without manual intervention. Is there any debugging information I can provide?

The environment I have in all my workers is the latest dask from conda forge, which installs cytoolz as a dependency.

``` # packages in environment at /home/zah/anaconda3/envs/dask: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 1_llvm conda-forge aiohttp 3.7.3 py38h497a2fe_0 conda-forge argon2-cffi 20.1.0 py38h25fe258_2 conda-forge async-timeout 3.0.1 py_1000 conda-forge async_generator 1.10 py_0 conda-forge attrs 20.3.0 pyhd3deb0d_0 conda-forge backcall 0.2.0 pyh9f0ad1d_0 conda-forge backports 1.0 py_2 conda-forge backports.functools_lru_cache 1.6.1 py_0 conda-forge bidict 0.21.2 py_0 conda-forge blas 2.17 openblas conda-forge bleach 3.2.1 pyh9f0ad1d_0 conda-forge bokeh 2.2.3 py38h578d9bd_0 conda-forge ca-certificates 2020.12.5 ha878542_0 conda-forge cairo 1.16.0 h9f066cc_1006 conda-forge certifi 2020.12.5 py38h578d9bd_0 conda-forge cffi 1.14.4 py38ha65f79e_1 conda-forge chardet 3.0.4 py38h924ce5b_1008 conda-forge click 7.1.2 pyh9f0ad1d_0 conda-forge cloudpickle 1.6.0 py_0 conda-forge cycler 0.10.0 py_2 conda-forge cytoolz 0.11.0 py38h25fe258_1 conda-forge dask 2020.12.0 pyhd8ed1ab_0 conda-forge dask-core 2020.12.0 pyhd8ed1ab_0 conda-forge dbus 1.13.6 hfdff14a_1 conda-forge decorator 4.4.2 py_0 conda-forge defusedxml 0.6.0 py_0 conda-forge distributed 2020.12.0 py38h578d9bd_0 conda-forge entrypoints 0.3 pyhd8ed1ab_1003 conda-forge expat 2.2.9 he1b5a44_2 conda-forge flask 1.1.2 pyh9f0ad1d_0 conda-forge flask-sqlalchemy 2.4.4 pyh9f0ad1d_0 conda-forge fontconfig 2.13.1 h7e3eb15_1002 conda-forge freetype 2.10.4 h7ca028e_0 conda-forge fribidi 1.0.10 h36c2ea0_0 conda-forge fsspec 0.8.5 pyhd8ed1ab_0 conda-forge gettext 0.19.8.1 h0b5b191_1005 conda-forge glib 2.66.4 hcd2ae1e_1 conda-forge graphite2 1.3.13 h58526e2_1001 conda-forge graphviz 2.42.3 h0511662_0 conda-forge gst-plugins-base 1.14.5 h0935bb2_2 conda-forge gstreamer 1.18.2 ha23517c_0 conda-forge harfbuzz 2.7.2 ha5b49bf_1 conda-forge heapdict 1.0.1 py_0 conda-forge icu 67.1 he1b5a44_0 conda-forge idna 2.10 pyh9f0ad1d_0 conda-forge importlib-metadata 3.3.0 py38h578d9bd_2 conda-forge importlib_metadata 3.3.0 hd8ed1ab_2 conda-forge ipykernel 5.4.2 py38h81c977d_0 conda-forge ipython 7.19.0 py38h81c977d_0 conda-forge ipython_genutils 0.2.0 py_1 conda-forge isodate 0.6.0 py_1 conda-forge itsdangerous 1.1.0 py_0 conda-forge jedi 0.17.2 py38h578d9bd_1 conda-forge jinja2 2.11.2 pyh9f0ad1d_0 conda-forge jpeg 9d h36c2ea0_0 conda-forge jsonschema 3.2.0 py_2 conda-forge jupyter-server-proxy 1.5.2 pyhd8ed1ab_0 conda-forge jupyter_client 6.1.7 py_0 conda-forge jupyter_core 4.7.0 py38h578d9bd_0 conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge kiwisolver 1.3.1 py38h82cb98a_0 conda-forge krb5 1.17.2 h926e7f8_0 conda-forge lcms2 2.11 hcbb858e_1 conda-forge ld_impl_linux-64 2.35.1 hea4e1c9_1 conda-forge libblas 3.8.0 17_openblas conda-forge libcblas 3.8.0 17_openblas conda-forge libclang 11.0.0 default_ha5c780c_2 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libevent 2.1.10 hcdb4288_3 conda-forge libffi 3.3 h58526e2_2 conda-forge libgcc-ng 9.3.0 h5dbcf3e_17 conda-forge libgfortran-ng 7.5.0 hae1eefd_17 conda-forge libgfortran4 7.5.0 hae1eefd_17 conda-forge libglib 2.66.4 h164308a_1 conda-forge libiconv 1.16 h516909a_0 conda-forge liblapack 3.8.0 17_openblas conda-forge liblapacke 3.8.0 17_openblas conda-forge libllvm10 10.0.1 he513fc3_3 conda-forge libllvm11 11.0.0 he513fc3_0 conda-forge libopenblas 0.3.10 pthreads_hb3c22a3_5 conda-forge libpng 1.6.37 h21135ba_2 conda-forge libpq 12.3 h255efa7_3 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libstdcxx-ng 9.3.0 h2ae2ef3_17 conda-forge libtiff 4.2.0 hdc55705_0 conda-forge libtool 2.4.6 h58526e2_1007 conda-forge libuuid 2.32.1 h7f98852_1000 conda-forge libwebp-base 1.1.0 h36c2ea0_3 conda-forge libxcb 1.13 h14c3975_1002 conda-forge libxkbcommon 1.0.3 he3ba5ed_0 conda-forge libxml2 2.9.10 h68273f3_2 conda-forge llvm-openmp 11.0.0 hfc4b9b4_1 conda-forge llvmlite 0.35.0 py38h4630a5e_0 conda-forge locket 0.2.0 py_2 conda-forge lz4-c 1.9.2 he1b5a44_3 conda-forge markupsafe 1.1.1 py38h8df0ef7_2 conda-forge matplotlib 3.3.3 py38h578d9bd_0 conda-forge matplotlib-base 3.3.3 py38h5c7f4ab_0 conda-forge mistune 0.8.4 py38h25fe258_1002 conda-forge msgpack-python 1.0.2 py38h1fd1430_0 conda-forge multidict 5.1.0 py38h497a2fe_0 conda-forge mysql-common 8.0.22 ha770c72_1 conda-forge mysql-libs 8.0.22 h1fd7589_1 conda-forge nbclient 0.5.1 py_0 conda-forge nbconvert 6.0.7 py38h578d9bd_3 conda-forge nbformat 5.0.8 py_0 conda-forge ncurses 6.2 h58526e2_4 conda-forge nest-asyncio 1.4.3 pyhd8ed1ab_0 conda-forge notebook 6.1.5 py38h578d9bd_0 conda-forge nspr 4.29 he1b5a44_1 conda-forge nss 3.60 hb5efdd6_0 conda-forge numba 0.52.0 py38h51da96c_0 conda-forge numpy 1.19.2 py38h6163131_0 numpy-base 1.19.2 py38h75fe3a5_0 olefile 0.46 pyh9f0ad1d_1 conda-forge openssl 1.1.1i h7f98852_0 conda-forge packaging 20.8 pyhd3deb0d_0 conda-forge pandas 0.25.3 py38hb3f55d8_0 conda-forge pandoc 2.11.3.1 h7f98852_0 conda-forge pandocfilters 1.4.2 py_1 conda-forge pango 1.42.4 h69149e4_5 conda-forge parso 0.7.1 pyh9f0ad1d_0 conda-forge partd 1.1.0 py_0 conda-forge pcre 8.44 he1b5a44_0 conda-forge pexpect 4.8.0 pyh9f0ad1d_2 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 8.0.1 py38h70fbd49_0 conda-forge pip 20.3.3 pyhd8ed1ab_0 conda-forge pixman 0.40.0 h36c2ea0_0 conda-forge prometheus_client 0.9.0 pyhd3deb0d_0 conda-forge prompt-toolkit 3.0.8 pyha770c72_0 conda-forge psutil 5.8.0 py38h497a2fe_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptyprocess 0.6.0 py_1001 conda-forge pycparser 2.20 pyh9f0ad1d_2 conda-forge pygments 2.7.3 pyhd8ed1ab_0 conda-forge pygraphviz 1.6 py38h25c7686_1 conda-forge pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge pyqt 5.12.3 py38h578d9bd_6 conda-forge pyqt-impl 5.12.3 py38h7400c14_6 conda-forge pyqt5-sip 4.19.18 py38h709712a_6 conda-forge pyqtchart 5.12 py38h7400c14_6 conda-forge pyqtwebengine 5.12.1 py38h7400c14_6 conda-forge pyrsistent 0.17.3 py38h25fe258_1 conda-forge python 3.8.5 h7579374_1 python-dateutil 2.8.1 py_0 conda-forge python-dotenv 0.15.0 pyhd8ed1ab_0 conda-forge python-graphviz 0.15 pyhd3deb0d_0 conda-forge python_abi 3.8 1_cp38 conda-forge pytz 2020.4 pyhd8ed1ab_0 conda-forge pyyaml 5.3.1 py38h8df0ef7_1 conda-forge pyzmq 20.0.0 py38h1d1b12f_1 conda-forge qt 5.12.9 h763d07f_1 conda-forge readline 8.0 he28a2e2_2 conda-forge scipy 1.5.3 py38h828c644_0 conda-forge send2trash 1.5.0 py_0 conda-forge setuptools 49.6.0 py38h924ce5b_2 conda-forge simpervisor 0.3 py_1 conda-forge simpy 4.0.1 pypi_0 pypi six 1.15.0 pyh9f0ad1d_0 conda-forge sortedcontainers 2.3.0 pyhd8ed1ab_0 conda-forge sqlalchemy 1.3.22 py38h497a2fe_0 conda-forge sqlite 3.34.0 h74cdb3f_0 conda-forge tblib 1.6.0 py_0 conda-forge terminado 0.9.1 py38h32f6830_1 conda-forge testpath 0.4.4 py_0 conda-forge tk 8.6.10 h21135ba_1 conda-forge toolz 0.11.1 py_0 conda-forge tornado 6.1 py38h25fe258_0 conda-forge traitlets 5.0.5 py_0 conda-forge typing-extensions 3.7.4.3 0 conda-forge typing_extensions 3.7.4.3 py_0 conda-forge wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge webencodings 0.5.1 py_1 conda-forge werkzeug 1.0.1 pyh9f0ad1d_0 conda-forge wheel 0.36.2 pyhd3deb0d_0 conda-forge xorg-kbproto 1.0.7 h14c3975_1002 conda-forge xorg-libice 1.0.10 h516909a_0 conda-forge xorg-libsm 1.2.3 h84519dc_1000 conda-forge xorg-libx11 1.6.12 h516909a_0 conda-forge xorg-libxau 1.0.9 h14c3975_0 conda-forge xorg-libxdmcp 1.1.3 h516909a_0 conda-forge xorg-libxext 1.3.4 h516909a_0 conda-forge xorg-libxpm 3.5.13 h516909a_0 conda-forge xorg-libxrender 0.9.10 h516909a_1002 conda-forge xorg-libxt 1.1.5 h516909a_1003 conda-forge xorg-renderproto 0.11.1 h14c3975_1002 conda-forge xorg-xextproto 7.3.0 h14c3975_1002 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.5 h516909a_1 conda-forge yaml 0.2.5 h516909a_0 conda-forge yarl 1.6.3 py38h497a2fe_0 conda-forge zeromq 4.3.3 h58526e2_3 conda-forge zict 2.0.0 py_0 conda-forge zipp 3.4.0 py_0 conda-forge zlib 1.2.11 h516909a_1010 conda-forge zstd 1.4.5 h6597ccf_2 conda-forge ```
Zaharid commented 3 years ago

After some investigation I think that what is happening is that some memory limit (80%?) is hit and that somehow causes tasks to be marked as started but not actually be processed. This can happen because of e.g. #3530.

Zaharid commented 3 years ago

Here a way to reproduce the problem reliably on my machine:

from dask.distributed import Client

N = 1_000_000_000

def leak(mem):
    import ctypes

    libc = ctypes.CDLL("libc.so.6")
    libc.malloc.restype = ctypes.c_void_p
    x = libc.malloc(mem)
    x = ctypes.cast(x, ctypes.c_void_p)
    libc.memset(x, 1, mem)

def f():
    return "Hello"

if __name__ == '__main__':
    c = Client(threads_per_worker=1, n_workers=1, memory_limit=N)
    leak_future = c.submit(leak, int(N * 0.81))
    # This is needed here for some reason
    import time
    time.sleep(1)
    future = c.submit(f)
    leak_future.result()
    print("Leaked memory")
    future.result()
    print("Never reached this. Cluster deadlocked")

Running the script above I see a continous stream of warnings

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 864.24 MB -- Worker memory limit: 1000.00 MB

but I do not see them when the cluster is similarly deadlocked with the production code.

simone-codeluppi commented 3 years ago

Hi I have a similar issue as reported above.

I am processing 800-1000 3D microscopy images using dask/dask-jobqueue on an htcondor managed cluster. Each 3D image is processed in parallel (embarrassingly parallel application). Each image is processed in a separated job managed by a worker. The min and max number of jobs/workers is regulated by adaptive scaling (min, max). The expected processing time of a task is quite long (~1hr)

The processing time of each task increases over time and usually the last tasks take hrs (~5-10hrs) to be completed or the processing stalls. The tasks that are run for longer time or stall are random (different images can be processed last). This is similar to what is reported here: #2835 but having ctoolz didn't solve the issue

The task log for this long running tasks shows that the task status cycle between waiting --> running to processing-->released many many times. The processing resources aren't an issue. In some cases an error related to connection to the scheduler is printed (see below) so it seems that the worker (still up in the dashboard) looses the connection to the scheduler (temporarily)


distributed.utils - ERROR - Timed out during handshake while connecting to tcp://192.168.0.2:35688 after 10 s                                      [112/221]
Traceback (most recent call last):
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/home/simone/mini/envs/test_d/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/utils.py", line 655, in log_errors
    yield
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/scheduler.py", line 3515, in retire_workers
    await self.replicate(
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/scheduler.py", line 3281, in replicate
    results = await asyncio.gather(
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect
    comm = await connect(
  File "/home/simone/mini/envs/test_d/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://192.168.0.2:35688 after 10 s

It is difficult for me to provide a working minimal example because the issue happens only when the processing has been going on for quite a while (hrs) and not always after the same amount of time.

Any hint to what to look for in order to run a proper debugging will be helpful

Thanks.

jacobtomlinson commented 3 weeks ago

This issue still requires more info but there has been no activity here for many years, so I'm going to close it out.