-
I have two servers, Dell and FusionServer, nccl-test don't work ,but if all servers is same model,the ncct-test can work
my environment
```
os: ubuntu 22.04
cuda: 12.4
NV drvier: 550
```
wh…
SdEnd updated
2 months ago
-
MVAPICH requires the passing of a hostfile at the mpirun launch. If you are bundling tasks, currently, you give them all the complete hostlist for the job. Would it be possible to have METAQ split t…
-
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
I cant get the IP for each node before…
-
http://pi-star/admin/update_HostFile_DMRIds.php
![image](https://github.com/JTA-STAR/J-STAR/assets/22002824/b35adbb6-d67e-4be1-aa2c-53ca03ddd7cb)
-
Есть ли возможность в zapret (интересует в основном nfqws) отсылать несколько разных фейков (явно заданных) в определенной последовательности, на подобии как в последнем GoodbyeDPI задание нескольких …
-
请问使用deepspeed分布式训练,编写hostfile,第一行为master是吧,运行命令deepspeed --master,这两个是不是有冲突
-
A bug in the esm-tool derived distribution of processors at ECMWF atos has been detected by Paul Dando from ECMWF after I reported extremely slow execution times of AWI-CM 3.1 on ECMWF-atos machine.
…
-
A user said in an email:
> My code doesn't actually use openMPI for communication (and none is needed for single gpu jobs), the only reason I use `mpirun ` is because it's the only way (afaik) to i…
-
I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command w…
-
### System Info
2 * 4 L40s load llama2-70B, 1 model: tensorrt_llm.
using image: nvcr.io/nvidia/tritonserver:23.11-trtllm-python-py3
### Who can help?
_No response_
### Information
- […