-
**Environment:**
1. Framework: TensorFlow
2. Framework version: 2.12.0
3. Horovod version: horovod-0.27.0
4. MPI version: openmpi-4.1.5
5. CUDA version: 11.8
6. NCCL version:
![image](https://…
-
Environment:
NCCL Version: 2.4.8 and 2.5.7
CUDA Version: 10.0
OS Version: CentOS 7
Problem:
We are running 4 containers per container 1 GPU on the same node, the docker run command is:
`docke…
-
Is there a reason for using Standard_NC24s_v3 rather than the RDMA capable Standard_NC24rs_v3?
-
### Description
I use Ray in an HPC cluster. The cluster has InfiniBand which has low latency and high bandwidth. Ray is based on gRPC and data transferring uses gRPC, too. I can use IPoIB(Internet …
-
Since some features of pytorch are not yet supported for ComplexFloat tensors, it would be desirable to have a "switch" to turn off complex tensor completely (maybe in a different branch?).
My part…
-
**Describe the bug**
from the file Megatron-LM/megatron/training/arguments.py
```
group.add_argument('--no-position-embedding',
action='store_false',
…
-
The example is running on the NCCL backend for distributed GPU settings. I'm wondering if it can profile correctly on a multi-node (multiple CPU servers) distributed CPU settings with Gloo backend?
…
-
**Environment:**
1. Framework: TensorFlow
2. Framework version: 1.15.0
3. Horovod version: 0.19.1
4. MPI version: 4.0.2
5. CUDA version: 10.0
6. NCCL version: 2.4.7
7. Python version: 3.6.8
8…
-
## 🚀 Feature
Consider the following piece of code
```python
def write_preds_to_file(predictions, filename):
prediction_tensor = torch.tensor(predictions)
prediction_tensor = idist.all_g…
-
### Description of the bug | 错误描述
fail to do the first run as suggested by step 8 in `README_Ubuntu_CUDA_Acceleration_en_US.md`
### How to reproduce the bug | 如何复现
- Explain the steps require…