Multi-GPU not supported for Windows

KonradDanielewski commented 10 months ago

More of an information than a bug report. Native Windows NCCL is not available via conda-forge (also not available according to Nvidia docs), I don't know whether there is one precompiled with CUDA or something specifically for Windows

I'll try to compile a system agnostic one and check if it works. I found this issue, cause I have a dataset of 84 recordings (45k frames each) and it doesn't fit on one 4090 - but when trying to run:

from jax_moseq.utils import set_mixed_map_gpus
set_mixed_map_gpus(2)

So then when running:

model = kpms.init_model(data, pca=pca, **config())

it throws:

C:\anaconda3\envs\keypoint_moseq_gpu\lib\site-packages\jax\_src\dispatch.py:380: UserWarning:

The jitted function resample_discrete_stateseqs includes a pmap. Using jit-of-pmap can lead to inefficient data movement, as the outer jit does not preserve sharded data representations and instead collects input and output arrays onto a single device. Consider removing the outer jit unless you know what you're doing. See https://github.com/google/jax/issues/2926.

2023-11-16 15:30:23.348349: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 1 failed: UNIMPLEMENTED: NCCL support is not available: this binary was not built with a CUDA compiler, which is necessary to build the NCCL source library.
2023-11-16 15:30:34.961307: F external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2298] Replicated computation launch failed, but not all replicas terminated. Aborting process to work around deadlock. Failure message (there may have been multiple failures, see the error log for all failures):

NCCL support is not available: this binary was not built with a CUDA compiler, which is necessary to build the NCCL source library.

Found this issue, that may be useful in implementing a solution for Windows: https://github.com/tensorflow/tensorflow/issues/21470

calebweinreb commented 10 months ago

Thanks for the info! So far we haven't had many users trying to use multiple GPUs on Windows so haven't seen this yet. Keep me posted if you figure out a solution! I wonder if using system installs of CUDA/cudnn would help?

KonradDanielewski commented 10 months ago

Thanks for the info! So far we haven't had many users trying to use multiple GPUs on Windows so haven't seen this yet. Keep me posted if you figure out a solution! I wonder if using system installs of CUDA/cudnn would help?

Multi-GPU is related to NCCL (NVIDIA Collective Communications Library). There is apparently a system agnostic version available, I'll try to compile it, add to my CUDA installation and see if it works.

Techically it's not a big issue, I can just use part of the data to train the model, it should be fine anyway if I shuffle properly between all the groups.

dattalab / keypoint-moseq

Multi-GPU not supported for Windows #111