Open rjzamora opened 5 months ago
Just a note that the reproducer is certainly submiting tasks from the workers. Therefore this documentation is probably relevant. Perhaps there is a deadlock that can be avoided if xgboost knows it is running from a worker and can secede
/rejoin
when necessary?
You might try making lots of Dask clusters? This is done in this example: https://github.com/coiled/dask-xgboost-nyctaxi/blob/main/Modeling%203%20-%20Parallel%20HPO%20of%20XGBoost%20with%20Optuna%20and%20Dask%20(multi%20cluster).ipynb
On Wed, Apr 17, 2024 at 12:04 PM Richard (Rick) Zamora < @.***> wrote:
Just a note that the reproducer is certainly submiting tasks from the workers. Therefore this documentation https://distributed.dask.org/en/stable/task-launch.html#submit-tasks-from-worker is probably relevant. Perhaps there is a deadlock that can be avoided if xgboost knows it is running from a worker and can secede/rejoin when necessary?
— Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/8623#issuecomment-2061782281, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTF4QHIAG5K67CTK55LY52TRFAVCNFSM6AAAAABGLNI3TWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRG44DEMRYGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks @mrocklin ! I saw this repository the other day, and actually did suggest the multi-cluster approach to the original user who ran into this issue. It is clear to me that the multi-cluster approach should work well. What is not clear to me is the specific reason that the single-cluster approach will hang.
If it is possible for XGBoost to make the single-cluster approach work well, they seem interested to know what that change looks like.
I agree that it would be good if the xgboost.dask approach could happily share the cluster. If your question is "does it?" then I think that the answer is "no" at least when the first prototype was made. I think that you'd have to check with the current sourcecode or folks who wrote that code to be sure.
If the question is "could it?" then sure the answer is yes, although there are likely tricky things you'd want to figure out, like if xgboost jobs should queue, or if they should shard the cluster, or if they should run concurrently, etc. and how to manage that.
I think that you'd have to check with the current sourcecode or folks who wrote that code to be sure.
Thank you for the suggestion. Is there any particular pattern I should avoid (to not hang)?
If the question is "could it?" then sure the answer is yes,
Let's say it should run sequentially, which is the simplest case. Currently, XGBoost uses the distributed.MultiLock
before launching a training task to ensure all requested workers are ready for the training task. As a result, I expect multiple training tasks to run sequentially when the data is balanced across all workers instead of hang.
If nested parallelism is not supported, an error would also help.
I'm not sure I understand the question that is being asked. If the question is "how would you design xgboost.dask so that it could be run multiple times at once?" then I'd need to think a lot longer (although maybe someone else can help). That's an interesting question that probably deserves a lot of thought.
If the question is "how do you enforce sequential work" then it probably depends on what you're doing. For example, you could use a threading.Lock
or something similar around your train method to make sure that it wasn't run multiple times? That would be a very simple solution that has nothing to do with Dask.
If you're doing something like the above then you'd want to swap out threading.Lock
for dask.distributed.Lock
and then yes, you'd want to make sure that the threads weren't totally locked. The MultiLock is waiting on all of the workers to be ready to run but the workers are busy waiting on your initial Lock.
If your question is "is this a bug" then I think my answer is "no", you're just dealing with lots of Locks and you've set them up in such a way that they deadlock. Using something like secede
/rejoin
as @rjzamora described above is one way to start working through this. You'll probably have to be careful though.
The MultiLock is waiting on all of the workers to be ready to run but the workers are busy waiting on your initial Lock.
That makes sense. So, if I understand it correctly, the observed hang is caused by the worker having a single thread, and that sole thread is being used to wait for the lock?
If your question is "is this a bug" then I think my answer is "no", you're just dealing with lots of Locks and you've set them up in such a way that they deadlock.
There's only one lock in xgboost. I can't think of another way to ensure the workers' availability.
I'm not sure I understand the question that is being asked.
Since the underlying implementation of xgboost.dask
is still quite similar to the original dask-xgboost
with the addition of a multi-lock, my question is whether there's an anti-pattern that prevents xgboost from being used in this nested way. But if it's caused by the thread, then the question is not needed. Apologies for the ambiguity.
If my understanding of the underlying cause is correct, then I agree that it's not a bug but more a limitation of the current design. This will bring us back to the feature request for dask to support collective/mpi tasks so that xgboost can remove/replace the lock. Will leave that for another dask sync. ;-)
So, if I understand it correctly, the observed hang is caused by the worker having a single thread, and that sole thread is being used to wait for the lock
I don't know for sure. I'd have to look at the xgboost.dask code to see what is going on.
Will leave that for another dask sync. ;-)
I recommend raising an issue. Most technical conversation for Dask happens in issues, not at the monthly meetings. Alternatively, if you want to talk synchronously with someone you could e-mail that person and ask for a conversation. (I am not the person to talk to here FWIW).
I very briefly looked over the implementation of the xgboost dask integration. It's a little difficult to wrap my head around what's going on. However, I'm pretty sure it's because you are indeed submitting tasks from within tasks without seceding.
Client
s is sometimes not working very well (e.g. when the asynchronous keyword is not properly set) and sometimes it is a little off when working with collections. This is just a recommendation and probably not related to the deadlockworker_client
which wraps the correct secede and rejoin calls in a contextmanager and makes sure you get a cached client on the worker. This is typically more robust and less error prone than implementing anything yourself with secede/rejoinI recommend raising an issue.
Thank you for the suggestion, will try to write a feature request.
Description of possible bug
When I try to fit/evaluate many
xgboost.dask
models in parallel, one or more of the fit/evaluation futures hangs forever. The hang only occurs whenthreads_per_worker=1
(which is the default for GPU-enabled clusters in dask-cuda).Is this a bug, or is the reproducer shown below known to be wrong or dangerous for a specific reason?
Reproducer
Further background
There is a great blog article from Coiled that demonstrates Optuna-based HPO with XGBoost and dask. The article states: "the current
xgboost.dask
implementation takes over the entire Dask cluster, so running many of these at once is problematic." Note that the "problematic" practice described in that article is exactly what the reproducer above is doing. With that said, it is not clear to me why one or more workers might hang.NOTE: I realize this is probably more of an
xgboost
issue than adistributed
issue. However, it seems clear that significantdask/distributed
knowledge is needed to pin down the actual problem. Any and all help, advice, or intuition is greatly appreciated!