Open songqiqqq opened 4 years ago
@songqiqqq I'm very excited to see a ucx/dask-jobqueue experiment. You are correct you will need to add the --protocol ucx
but you will also need a handful of env variables to enable specific devices such as inifiniband and/or nvlink. We started to outline those configurations here:
https://ucx-py.readthedocs.io/en/latest/configuration.html
I believe for just IB you will need the following:
UCX_CUDA_IPC_CACHE=n UCX_MEMTYPE_CACHE=n UCX_TLS=rc,tcp,sockcm,cuda_copy UCX_SOCKADDR_TLS_PRIORITY=sockcm
dask-jobqueue may also been interested in building a nicer abstraction for configuring UCX. dask-cuda
has been experimenting with this here:
https://github.com/rapidsai/dask-cuda/blob/fbf3ef2cde8f42147945597d5bee81e6e388d5de/dask_cuda/dgx.py#L53-L61
It would be really nice if some people could dive into this. This would be really helpful for the community.
@mrocklin didn't you try things on Cheyenne?
I did. Things worked, but weren't yet any faster. The Dask + UCX team within RAPIDS (which @quasiben leads) is working on profiling and performance now, so hopefully we'll see some larger speedups soon.
As @quasiben states above, I did this just by adding the protocol="ucx://"
keyword to the FooCluster
classes.
Sorry that I am messed up with lots of work at the end of the year. I would try to keep up with you and report here if any advances.
@quasiben was looking at this at one point at the Dask workshop. @quasiben if you manage to make progress on this, let us know!
Thanks for the ping @lesteve I pushed up some recent work here: https://github.com/dask/dask-jobqueue/pull/390. Still early days -- I probably won't be able to spend much time on it until later this month
Thanks a lot!
This is a bit of a hack but may get you started quicker for your benchmarks, the approach I had in mind at the workshop (I did not explain as well as I could have probably):
submission-script.sh
(Submission script)
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -n 1
#SBATCH --cpus-per-task=1
#SBATCH --mem=954M
#SBATCH -t 00:30:00
export UCX_CUDA_IPC_CACHE=n
export UCX_MEMTYPE_CACHE=n
export UCX_TLS=rc,tcp,sockcm,cuda_copy
export UCX_SOCKADDR_TLS_PRIORITY=sockcm
JOB_ID=${SLURM_JOB_ID%;*}
python your-dask-jobqueue-script.py
your-dask-jobqueue-script.py
(Python script creating the Dask scheduler and workers)
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(..., env_var=['all the ucx env var go here'])
...
And then you submit your submission script with sbatch
:
sbatch submission-script.sh
This way the Dask scheduler will live on a cluster compute node with Infiniband.
If anything is not clear or does not quite work, let me know!
@lesteve the following is what i've been doing (note: my nodes are currently a DGX -- 8 gpu machine). The scheduler and workers are executed together, again, because my nodes are a DGX machine and the granularity is one node
#!/usr/bin/env bash
#SBATCH -J dask-scheduler
#SBATCH -n 1
#SBATCH -t 00:30:00
JOB_ID=${SLURM_JOB_ID%;*}
LOCAL_DIRECTORY=/gpfs/fs1/bzaitlen/dask-local-directory
UCX_NET_DEVICES=mlx5_0:1 DASK_RMM__POOL_SIZE=1GB DASK_UCX__ENABLE_INFINIBAND="True" DASK_UCX__ENABLE_NVLINK="True"
/gpfs/fs1/bzaitlen/miniconda3/envs/dask-jq/bin/python -m distributed.cli.dask_scheduler --protocol ucx \
--scheduler-file $LOCAL_DIRECTORY/dask-scheduler.json &
unset UCX_NET_DEVICES DASK_RMM__POOL_SIZE DASK_UCX__ENABLE_INFINIBAND DASK_UCX__ENABLE_NVLINK
sleep 5
/gpfs/fs1/bzaitlen/miniconda3/envs/dask-jq/bin/python -m dask_cuda.dask_cuda_worker \
--scheduler-file $LOCAL_DIRECTORY/dask-scheduler.json \
--rmm-pool-size=1GB --enable-nvlink --enable-tcp-over-ucx --enable-infiniband --net-devices="auto" \
--local-directory=$LOCAL_DIRECTORY
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -N 1
JOB_ID=${SLURM_JOB_ID%;*}
export DASK_DISTRIBUTED__WORKER__MEMORY__Terminate="False"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT="60s"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP="600s"
export LOCAL_DIRECTORY=/gpfs/fs1/bzaitlen/dask-local-directory
/gpfs/fs1/bzaitlen/miniconda3/envs/dask-jq/bin/python -m dask_cuda.dask_cuda_worker \
--scheduler-file $LOCAL_DIRECTORY/dask-scheduler.json \
--rmm-pool-size=1GB --enable-nvlink --enable-tcp-over-ucx --enable-infiniband --net-devices="auto" \
--local-directory=$LOCAL_DIRECTORY
I did. Things worked, but weren't yet any faster. The Dask + UCX team within RAPIDS (which @quasiben leads) is working on profiling and performance now, so hopefully we'll see some larger speedups soon.
As @quasiben states above, I did this just by adding the
protocol="ucx://"
keyword to theFooCluster
classes.
@mrocklin, could you point me to the setup you used on Cheyenne/Casper? I've been trying to launch a dask cluster with ucx protocol for communication. All my attempts have failed
Running the following
cluster = PBSCluster(protocol="ucx://", env_extra=["export UCX_TLS=tcp,sockcm",
"export UCX_SOCKADDR_TLS_PRIORITY=sockcm",
'export UCXPY_IFNAME="ib0"'])
Results in a timeout error.
I tried launching the scheduler from the command line, and I ran into a different error:
Am I making a trivial error, or do I need to do some extra setup for things to work properly?
Ccing @quasiben in case he has some suggestions, too.
@andersy005 can you post the version of ucx / ucx-py and what ifconfig returns on one of the compute nodes? When using IB (Infiniband) you will also need to set rc
to the UCX_TLS
env var
@quasiben,
Here are the versions I am using
# Name Version Build Channel
ucx 1.9.0+gcd9efd3 cuda11.0_0 rapidsai
ucx-proc 1.0.0 gpu conda-forge
ucx-py 0.19.0 py38_gcd9efd3_0 rapidsai
Here's the ifconfig output
I noticed that in my previously failed attempt, I was on the login node. I switched to a compute node, and launching the scheduler from the command line appears to be working. One thing to note here is that scheduler's IP address that ends up being used here is the Public IP address of the compute node:
$ dask-scheduler --protocol ucx
distributed.scheduler - INFO - -----------------------------------------------
/glade/work/abanihi/opt/miniconda/envs/dask-gpu/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42672 instead
warnings.warn(
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
[1621529384.933221] [crhtc53:187174:0] ucp_context.c:735 UCX WARN network device 'mlx5_0:1' is not available, please use one or more of: 'ext'(tcp), 'ib0'(tcp), 'mgt'(tcp)
distributed.scheduler - INFO - Scheduler at: ucx://PUBLIC_IP:8786
distributed.scheduler - INFO - dashboard at: :42672
When I try to set up a cluster using dask-jobqueue on the same compute node, I get the same error as before
In [3]: cluster = PBSCluster(protocol="ucx://", env_extra=["export UCX_TLS=rc,tcp,sockcm",
...: "export UCX_SOCKADDR_TLS_PRIORITY=sockcm",
...: 'export UCXPY_IFNAME="ib0"'])
322
323 def _correct_state(self):
RuntimeError: Cluster failed to start. Timed out trying to connect to ucx://PRIVATE_IP:38572 after 10 s
In this case, scheduler appears to be using the PRIVATE IP address instead of the PUBLIC IP address.
UCXPY_IFNAME
should have no effect in Dask. If you want to use a specific interface you should use the --interface
/interface
to launch the Dask scheduler, I'm not sure how/if that's exposed in dask-jobqueue. On a UCX level, you can limit the interfaces that the process can see with UCX_NET_DEVICES
, for example UCX_NET_DEVICES=ib0
, but I think you would still need to pass interface
to the scheduler. Also note that you do not have InfiniBand support built-in, as we don't ship conda packages with that enabled, so if that's what you're trying to test you'll have to build UCX from source.
I'm glad @andersy005 that you are trying to test this. Looks like there is some more testing and a good documentation effort to make all this work.
you should use the --interface/interface to launch the Dask scheduler, I'm not sure how/if that's exposed in dask-jobqueue Yes it is, so we should be OK on this part.
Also note that you do not have InfiniBand support built-in, as we don't ship conda packages with that enabled, so if that's what you're trying to test you'll have to build UCX from source. I think this is what we want!
I should add here that we also tested this a few months ago and found it to give no performance benefit (at least in our use case). We also found that it kills resilience, though this may have since changed.
It seems that the dask.distributed has supported the ucx protocol for the communications between workers and schedulers, which seems to have large advantages over tcp when equipped with infiniband. How can I use that with jobqueue? It seems not a hard thing because jobqueue is based on dask.distributed. If I add
--protocol ucx
option for scheduler and worker command, would that be ok ?