-
### Anything you want to discuss about vllm.
I'm experiencing a segmentation fault while running the vLLM API server with Ray for distributed inference. The issue seems to be related to NCCL initiali…
-
Currently, sumitit launcher is used to spawn multiple processes. We want to use multiprocesses.spawn if the user doesn't want to use submitit.
-
Running nccl test with 2 nodes with one A10G on each node with GDR disabled.
Why do I see the following line in the logs "DMA-BUF is available on GPU device 0". Will DMA_BUF be used when GDR is disa…
-
Why wasn't the method I generated using msccl-tools from the XML invoked when I executed the command :
>mpirun --allow-run-as-root -np 8 -x LD_LIBRARY_PATH=/home/msccl-tool/msccl/executor/msccl-exe…
-
**TL/DR:**
**Set env variable NCCL_ALGO=Tree if you meet accuracy problems with NCCL in A800 hardware.**
-----------------------------------------------------------------------------------------…
-
I use one machine and 4GPUs to run gpt3;
the first iteration is runnning without any errors,
but the second iteration makes errors , strucked by the second iteration and the second step,
the erros as…
-
After run successfully and passed several minutes, it occured this error:
**RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collect…
-
I'm running nccl-test `all-reduce` between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed tha…
-
I did a test of allgather using the NVLS algorithm and find the performance is poor compared the allreduce using NVLS on H20 with 8 GPUs.
The bandwidth of allgather using the NVLS is only 300GB while …
-
### Description
The end user may have an impression that type hint is applied to a DAG node, as opposed to the edge between DAG nodes/tasks.
This might be partially due to that the way we name t…