dask hang in drop_by_shallow or from_pandas

pseudotensor commented 3 years ago

I've yet to make a repro, but a basic code that is the same as we do is: https://github.com/dmlc/xgboost/issues/7032#issuecomment-858175004

It's hard to make a repro since it doesn't always happen, but the repro code I've attempted never hits the hang.

See https://github.com/dmlc/xgboost/issues/7032 for back trace of all threads during hang, etc. It's hanging inside xgboost but @trivialfis suggests it must be bug in dask/disributed: https://github.com/dmlc/xgboost/issues/7032#issuecomment-859222913

1) Using CLI launching dask scheduler and workers as in https://github.com/dmlc/xgboost/issues/7032 2) Using dask 2021.5 distributed 2021.5 . rapids 19 and 21.06 both have happen, but rapids 14 did not hang like this with same production code. 3) python 3.8 4) Ubuntu 18.04 5) conda install

(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ conda list | grep 'dask\|cu'
WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(143): Could not remove or rename /home/jon/minicondadai_py38/conda-meta/setuptools-49.6.0-py38h578d9bd_3.json.  Please remove this file manually (you may need to reboot to free file handles)
arrow-cpp                 1.0.1           py38h40c9144_40_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
cudatoolkit               11.2.72              h2bc3f7f_0    nvidia
cudf                      21.06.00        cuda_11.2_py38_gae440465c2_0    rapidsai
cudf_kafka                21.06.00        py38_gae440465c2_0    rapidsai
cugraph                   21.06.00        py38_gf9ffd2de_0    rapidsai
cuml                      21.06.01        cuda11.2_py38_g9f924a2d8_0    rapidsai
cupy                      9.0.0            py38ha69542f_0    conda-forge
cupy-cuda112              9.0.0                    pypi_0    pypi
cusignal                  21.06.00        py38_ga78207b_0    rapidsai
cuspatial                 21.06.00        py38_g37798cd_0    rapidsai
custreamz                 21.06.00        py38_gae440465c2_0    rapidsai
cuxfilter                 21.06.00        py38_g9459467_0    rapidsai
dask                      2021.5.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.5.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 21.06.00                 py38_0    rapidsai
dask-cudf                 21.06.00        py38_gae440465c2_0    rapidsai
dask-glm                  0.2.0                      py_1    conda-forge
libcudf                   21.06.00        cuda11.2_gae440465c2_0    rapidsai
libcudf_kafka             21.06.00          gae440465c2_0    rapidsai
libcugraph                21.06.00        cuda11.2_gf9ffd2de_0    rapidsai
libcuml                   21.06.01        cuda11.2_g9f924a2d8_0    rapidsai
libcumlprims              21.06.00        cuda11.2_gfda2e6c_0    nvidia
libcuspatial              21.06.00        cuda11.2_g37798cd_0    rapidsai
libfaiss                  1.7.0           cuda112h5bea7ad_8_cuda    conda-forge
librmm                    21.06.00        cuda11.2_gee432a0_0    rapidsai
pyarrow                   1.0.1           py38hb53058b_40_cuda    conda-forge
rapids                    21.06.00        cuda11.2_py38_ge3c8282_427    rapidsai
rapids-blazing            21.06.00        cuda11.2_py38_ge3c8282_427    rapidsai
rmm                       21.06.00        cuda_11.2_py38_gee432a0_0    rapidsai
torch                     1.8.1+cu111              pypi_0    pypi
torchvision               0.9.1+cu111              pypi_0    pypi
ucx                       1.9.0+gcd9efd3       cuda11.2_0    rapidsai
(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$

dantegd commented 3 years ago

Thanks for the report @pseudotensor

Cc @quasiben @jakirkham since this involves rapids (dask-cudf 0.19/21.06), hoping you can help to triage or ping the correct people from the team

pseudotensor commented 3 years ago

FYI main issue with xgboost issue was I could not make MRE that did same hang. So I was asking for help to directly debug, but they did not know how to help with that. I googled of course, and couldn't find anything useful to debug the hang.

trivialfis commented 3 years ago

but they did not know how to help with that.

Yeah, suggestions are appreciated.

fjetter commented 3 years ago

I didn't read through the entire linked issue so I'm guessing a bit from what I can parse off this description.

I assume your computation deadlocks at some point and you cannot find the reason for this?

We recently merged a PR (https://github.com/dask/distributed/pull/4784) which addresses three kinds of deadlock situations. these deadlocks are sometimes loud (worker died) but sometimes are completely silent. I encourage you to try out our main branch. These changes will be officially released later today (https://github.com/dask/community/issues/165)

pseudotensor commented 3 years ago

Thanks. How long does it take until conda-forge has it?

fjetter commented 3 years ago

my best guess is somewhere around 19UTC (our release manager is in CST/CDT timezone) but that's just a rough estimate and it might be later than that

quasiben commented 3 years ago

@pseudotensor are you able to try latest in main before the release ?

jrbourbeau commented 3 years ago

@pseudotensor you can also grab the 2021.06.1 release from PyPI now if that helps. Things should be on conda-forge in a couple of hours

pseudotensor commented 3 years ago

@jrbourbeau FYI I'd like to try it, but there is no rapids that works with it. I just get compatibility errors and/or other cuda related errors that look like API mismatches

pseudotensor commented 3 years ago

@jrbourbeau Tried rapids nightly 21.08 with dask/distributed 2021.6.2 and still same hangs.

How do I go about debugging this problem? It is not easy to make repro for some reason, so I'd like to directly debug the hung state. I shared gdb and python stack traces above. What can I do?

pseudotensor commented 3 years ago

@jrbourbeau updated to 2021.7 and still hangs. Need help debugging!

fjetter commented 2 years ago

Checking in again on this issue. We've made a lot of progress towards stability for deadlock situations like yours. I would strongly encourage you to try out our latest versions to see if your issue still persists. If a cluster deadlocks now, we have some utilities to create a dump of the cluster state to allow debugging after the fact, see http://distributed.dask.org/en/stable/api.html#distributed.Client.dump_cluster_state

dask / distributed

dask hang in drop_by_shallow or from_pandas #4926