Open pseudotensor opened 3 years ago
Thanks for the report @pseudotensor
Cc @quasiben @jakirkham since this involves rapids (dask-cudf 0.19/21.06), hoping you can help to triage or ping the correct people from the team
FYI main issue with xgboost issue was I could not make MRE that did same hang. So I was asking for help to directly debug, but they did not know how to help with that. I googled of course, and couldn't find anything useful to debug the hang.
but they did not know how to help with that.
Yeah, suggestions are appreciated.
I didn't read through the entire linked issue so I'm guessing a bit from what I can parse off this description.
I assume your computation deadlocks at some point and you cannot find the reason for this?
We recently merged a PR (https://github.com/dask/distributed/pull/4784) which addresses three kinds of deadlock situations. these deadlocks are sometimes loud (worker died) but sometimes are completely silent. I encourage you to try out our main branch. These changes will be officially released later today (https://github.com/dask/community/issues/165)
Thanks. How long does it take until conda-forge has it?
my best guess is somewhere around 19UTC (our release manager is in CST/CDT timezone) but that's just a rough estimate and it might be later than that
@pseudotensor are you able to try latest in main
before the release ?
@pseudotensor you can also grab the 2021.06.1 release from PyPI now if that helps. Things should be on conda-forge in a couple of hours
@jrbourbeau FYI I'd like to try it, but there is no rapids that works with it. I just get compatibility errors and/or other cuda related errors that look like API mismatches
@jrbourbeau Tried rapids nightly 21.08 with dask/distributed 2021.6.2 and still same hangs.
How do I go about debugging this problem? It is not easy to make repro for some reason, so I'd like to directly debug the hung state. I shared gdb and python stack traces above. What can I do?
@jrbourbeau updated to 2021.7 and still hangs. Need help debugging!
Checking in again on this issue. We've made a lot of progress towards stability for deadlock situations like yours. I would strongly encourage you to try out our latest versions to see if your issue still persists. If a cluster deadlocks now, we have some utilities to create a dump of the cluster state to allow debugging after the fact, see http://distributed.dask.org/en/stable/api.html#distributed.Client.dump_cluster_state
I've yet to make a repro, but a basic code that is the same as we do is: https://github.com/dmlc/xgboost/issues/7032#issuecomment-858175004
It's hard to make a repro since it doesn't always happen, but the repro code I've attempted never hits the hang.
See https://github.com/dmlc/xgboost/issues/7032 for back trace of all threads during hang, etc. It's hanging inside xgboost but @trivialfis suggests it must be bug in dask/disributed: https://github.com/dmlc/xgboost/issues/7032#issuecomment-859222913
1) Using CLI launching dask scheduler and workers as in https://github.com/dmlc/xgboost/issues/7032 2) Using dask 2021.5 distributed 2021.5 . rapids 19 and 21.06 both have happen, but rapids 14 did not hang like this with same production code. 3) python 3.8 4) Ubuntu 18.04 5) conda install