-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue y…
-
We have a server with 8 H100 GPU with cuda version 12.6 and nccl version 2.23.4.
When we are running nccl test as per the command provided in - https://github.com/nvidia/nccl-tests we are facing belo…
-
**Describe the bug**
The SoftDiceclDiceLoss implementation is different from Dice loss and in its current form could not be switched with Dice or other popular losses offered. There is no option for …
-
### Your current environment
```text
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC ve…
-
docker.io/utkuozdemir/nvidia_gpu_exporter:v1.2.1
-
Hi ucc maintainer,
I just wonder if ucc could support collective communications among Nvidia and AMD GPUs in one ML workload. Say the collective ring has half Nvidia and half AMD GPUs.
Best,
…
-
### Solution to issue cannot be found in the documentation.
- [x] I checked the documentation.
### Issue
I initially opened this issue on the [JAX repo](https://github.com/jax-ml/jax/issues/24604) …
-
I am using EndeavourOs. My laptop have intel and nvidia gpu . I activate the nvidia mode and restart . When I type envycontrol --query it says nvidia . But nvidia gpu doesnt work.
I can confirm nvid…
-
### Description
We have been using Bottlerocket 1.25 for the past 3 days. Since the upgrade, some of our GPU nodes (`g5.xlarge`) are failing to initialize the NVIDIA driver on Kubernetes 1.30. The af…
-
### What is the issue?
Hello,
This bug is related to a boot failure on Ubuntu Server 22.04 with an `Out of memory` error.
I'm trying to install Ollama on Ubuntu Server 22.04 to run a local dedi…