Dask Merge Benchmark w/ IB

In dask-cuda we benchmark how Dask+RAPIDS (cuDF) performs merges with a combination of accelerated computing (GPUs) and accelerated networking (NVLink / InfiniBand ). Some Dask users (especially those at national labs and supercomputing facilities) have access to InfiniBand devices but not GPUs. InfiniBand NICs typically provided faster transfer speeds compared with standard ethernet devices, but what makes them more interesting is the RMDA capabilities.

I am interested exploring CPU based workloads with InfiniBand -- this would be Dask (NumPy/Pandas)+UCX-Py (with IB/RDMA) -- and comparing the same workload with TCP. I started https://github.com/rapidsai/dask-cuda/pull/338 to use NumPy/Pandas instead of only CuPy/cuDF. I need some additional help with configuring an appropriate Dask setup. I'll be running on a DGX1 machine which has 4 IB devices, 40 cores (80 threads), and 1TB of memory. I was not planning on tweaking Dask too much but happy to test a variety of setups if there are suggestions. I was thinking of two tests for IB cases

1) workers are evenly divided across IB devices 2) workers are all configured on the same IB device

This is mostly due to naive assumptions about national labs using only one IB NIC per node.

In a stackoverflow question on configuring workers @mrocklin generally suggested a ratio of 8 threads / process

dask-worker tests:

2) 1 process with 80 threads 3) 2 processes with 40 threads each 1) 10 proceses with 8 threads each

I am not as familiar with tuning Dask for CPU workloads so if others have suggestions I would be very appreciative.

I'd be happy to take a look and play here. My guess is that the right thread/process ratio will depend on the particular operations that are the bottlenecks here. Most of pandas releases the GIL pretty well but not perfectly.

On Tue, Jul 7, 2020 at 1:57 PM Benjamin Zaitlen notifications@github.com wrote:

In dask-cuda we benchmark how Dask+RAPIDS (cuDF) https://medium.com/rapids-ai/high-performance-python-communication-with-ucx-py-221ac9623a6a performs merges with a combination of accelerated computing (GPUs) and accelerated networking (NVLink / InfiniBand ). Some Dask users (especially those at national labs and supercomputing facilities) have access to InfiniBand devices but not GPUs. InfiniBand NICs typically have faster interconnects compared with standard ethernet devices, but what makes them more interesting is the RMDA https://en.wikipedia.org/wiki/Remote_direct_memory_access capabilities.

I am interested exploring CPU based workloads with InfiniBand -- this would be Dask (NumPy/Pandas)+UCX-Py (with IB/RDMA) -- and comparing the same workload with TCP. I started rapidsai/dask-cuda#338 https://github.com/rapidsai/dask-cuda/pull/338 to use NumPy/Pandas instead of only CuPy/cuDF. I need some additional help with configuring an appropriate Dask setup. I'll be running on a DGX1 machine which has 4 IB devices, 40 cores (80 threads), and 1TB of memory. I was not planning on tweaking Dask too much but happy to test a variety of setups if there are suggestions. I was thinking of two tests for IB cases

workers are evenly divided across IB devices

workers are all configured on the same IB device

This is mostly due to naive assumptions about national labs using only one IB NIC per node.

In a stackoverflow question https://stackoverflow.com/a/51100959/2258766 on configuring workers @mrocklin https://github.com/mrocklin generally suggested a ratio of 8 threads / process

dask-worker tests:

1 process with 80 threads

2 processes with 40 threads each

10 proceses with 8 threads each

I am not as familiar with tuning Dask for CPU workloads so if others have suggestions I would be very appreciative.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/3951, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGVYL45CRAIMD2PCJ3R2OD37ANCNFSM4OTTLN6A .

dask / distributed

Dask Merge Benchmark w/ IB #3951