Open cjnolet opened 3 years ago
The idea is to soon move away from requiring to specify transports and just letting UCX decide, it seems that with 1.11 we may be able to do so as many of the limitations we had in the past have been solved, with the exception of creating a CUDA context (for which we need cuda_copy
to be specified today). Since that's the case, we may resolve that issue in a more elegant and less error prone way by just not requiring users to specify those.
However, for general UCX variables such as UCX_MAX_RNDV_RAILS
, that becomes more complex as we would need to parse the variables and potentially deal with overrides in case the user wants/needs to specify different variables for other workers, for example. Finally, tampering with UCX-specific data in scheduler.json
may be a bit too intrusive for Dask, I would be ok with that but I'm not sure how others feel about it.
Also, I think this is a more appropriate discussion for https://github.com/dask/distributed , maybe someone with permissions could move it there?
cc @quasiben @jakirkham @madsbk for visibility
An alternative would be to put those values into configuration files. In situations where we're using the scheduler file I suspect that we also have a shared file system and so using a shared configuration file might work ok?
If we want to add support for arguments in the scheduler file then I would ask that we think about how to do this holistically, rather than just for UCX (although I suspect that this is what you were thinking anyway).
On Wed, Jun 2, 2021 at 5:52 PM Corey J. Nolet @.***> wrote:
Currently, enabling Infiniband support for a Dask cluster using UCX requires that the workers and client be configured w/ the same connection arguments as the scheduler. As an example of a realistic configuration in a CUDA environment, a user might need to specify options to enable tcp-over-ucx, nvlink, infiniband and potentially rdmacm. In addition, they might need to manually specify a value for UCX_MAX_RNDV_RAILS. These options currently need to be specified for the scheduler, workers, and the client, which becomes painful for users who might be working to test multiple different configurations and might not know to specify one or more of them.
The more I think through the problem, I can't think of a reason why the values for these arguments would ever differ between a client/worker and the scheduler. It would make the deployment much more straightforward and alleviate a lot of the pain by storing these configuration arguments with the scheduler.json file.
Tagging @rlratzel https://github.com/rlratzel and @pentschev https://github.com/pentschev for awareness and further thoughts.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/4877, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTE3BASTSZDM2BZIWW3TQ2Y2ZANCNFSM457T2ZBA .
Currently, enabling Infiniband support for a Dask cluster using UCX requires that the workers and client be configured w/ the same connection arguments as the scheduler. As an example of a realistic configuration in a CUDA environment, a user might need to specify options to enable
tcp-over-ucx
,nvlink
,infiniband
and potentiallyrdmacm
. In addition, they might need to manually specify a value forUCX_MAX_RNDV_RAILS
. These options currently need to be specified for the scheduler, workers, and the client, which becomes painful for users who might be working to test multiple different configurations and might not know to specify one or more of them.The more I think through the problem, I can't think of a reason why the values for these arguments would ever differ between a client/worker and the scheduler. It would make the deployment much more straightforward and alleviate a lot of the pain by storing these configuration arguments with the
scheduler.json
file.Tagging @rlratzel and @pentschev for awareness and further thoughts.