-
Hi,
I am trying to run NVAE on my machine with your command line for CIFAR10 (updating only the .. from 8 to 4 cause I own 4 GPUs):
```
export EXPR_ID=/home/dsi/eyalbetzalel/NVAE/logs
export…
-
When I compile caffe with NCCL, there are erross:
src/caffe/parallel.cpp: In instantiation of ‘void caffe::NCCL::Run(const std::vector&, const char*) [with Dtype = float]’:
src/caffe/parallel.cpp:37…
-
The unit test in title have been using fixed seed to mask flakiness. Suggested action:
1. Evaluate whether the test is flaky without fixed seed. If not, remove seed. Else move to 2
2. If test is fla…
-
When I run _Meta-Llama-3-8B-Instruct_ or _Meta-Llama-3.1-8B-Instruct_ with
1. python 3.12.5
2. scalellm 0.1.9+cu118torch2.2.2
3. torch 2.2.2+cu1…
-
Currently, we build two wheel variants: `xgboost-cpu` (which excludes GPU code) and `xgboost` (where the GPU code targets CUDA 12.4). In #10729, `xgboost` is found to conflict with another package us…
-
Including PCI-E, RDMA, TCP/IP and other scenarios, I do not know what kind of test is appropriate.
-
set max_steps=500, save_steps=100
When it reaches step 100, the checkpoint is saved successfully but nccl_timeout is displayed
-
HI, also for pytorch distribute training and infer use multiple Nviida Gpu communite message framework, https://github.com/NVIDIA/nccl ,eagerly need it code with javacpp-pytorch,thanks
-
I am using the `mpirun `command to test the all_reduce_perf file of nccl-tests on two servers within the same local area network. I am able to run other files normally with the `mpirun `command, but w…
-
I can add a NCCL tests example but before I do would be great to see if that's something that would be accepted.