-
Hello, I want to use two GPU non-blocking streams for communication and cuMemcpyAsync respectively to accelerate.
GPU: V100 32GB NCCL:NCCL version 2.13.4+cuda11.7 and I use IB.
I mean does nccl use …
-
Hello,
I encountered an error while trying to install and run the script on Linux. Script ran fine but when I tried to start the app I got an error message.
```
Traceback (most recent call last…
-
(base) root@6633711ec9b0:/home/data/VisualGLM-6B# bash finetune/finetune_visualglm_qlora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:0 …
-
I am working on a plugin to use a different algorithm for allreduce. While I have been able to understand most of the code required, I still have a few questions:
1) I defined my plugin and run the…
-
[2021-08-17T22:56:28.664111] Starting Linux command : python train.py --epochs 1 --data-dir /mnt/batch/tasks/shared/LS_root/jobs/opendatasetspmworkspace/azureml/6215701e-b1ef-42d0-91d1-864583d0db…
-
Details:
Traceback (most recent call last):
File "/gf3/home/lei/zhenghao/Autoplanner/test/manual_pp/pipeline2x4_ptip.py", line 178, in
run_stage()
File "/gf3/home/lei/zhenghao/Autoplanner…
-
**Describe the bug**
I am trying to use Cutlass Python and build it from source.
My environment is formed by Ubuntu 18.04, cuda 11.8, GPU Nvidia Tesla V100 volta, python3.10, make 3.19 and GCC versio…
-
1. 安装官网教程测试图像分类没有任何问题,自己测试目标识别出现问题,执行代码如下:
![微信图片_20240718103926](https://github.com/user-attachments/assets/7151813e-43dc-45c0-bdaa-b0b2dc0221a3)
加载dockers:
docker run --name paddlex -v /model_p…
-
I am launching nccl.collective_permute on a trn1.32xlarge. Within the workload, each neuron core sends data to neighboring worker following a pre-specified topology. However, some of the workers exper…
-
### Description
The first time I encountered this error was run mult-node. Then after I run another code, single node also encountered this problem which was ok before. I think this error has s…