-
I'm running into cases where I have to allgather extra bytes(say 4 bytes), which make the data not perfectly 32byte or 64byte aligned. While doing this, substantial performance degradation was observe…
-
I was debugging the following issue in PyTorch with regards to nccl send/recv: https://github.com/pytorch/pytorch/issues/50092. I tried to see if I could somehow reproduce the issue in NCCL itself to …
-
[Llama-3論文](https://arxiv.org/abs/2407.21783)の3.3.3 Collective Communication、3.3.4 Reliability and Operational Challenges における、NCCLXに類似する機能を作りたいモチベーション
-
### Describe the bug
This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.
### Reproduction
Run `accelerate config`
```
comp…
kopyl updated
10 hours ago
-
There was a strange issue When we using A100 do benchmark testing. The command as follow:
mpirun -np 16 -H rdma1:8,rdma2:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_…
-
I am trying to train models on multiple nodes with SLURM as a workload manager. The Issue seems to be with the Python virtual environment not available to all nodes. Please find more details below. …
-
Hi,
I use ArchLinux with dual GPUs and connected with NVLink. I install the `cuda` and `nccl` from the community repo.
```
cuda 11.8.0-1
nccl 2.15.5-1
```
I use the following command
`
CUDA_…
-
Hello,
After few experiments it seems that NCCL uses a double ring topology for data transfer. Is double ring the default? Or is there an option to change to single ring topology? I am investigatin…
-
Hi, developer:
I run nccl-tests (all_reduce_perf) with 8 GPUs (NVIDIA A30) and 2 NICs (100G) between 2 same GPU server (PCIe4.0). The topology of the GPU server is as follows
```shell
GPU0 GPU…
-
File "./tools/train_net_multi_gpu.py", line 109, in
max_iter=args.max_iters, gpus=gpus)
File "/home/jzheng/PycharmProjects/bottom-up-attention/tools/../lib/fast_rcnn/train_multi_gpu.py", li…