-
We had an unplanned service disruption on the production cluster this Tuesday, 11/19. I wanted to document what happened because I think there are several important lessons we can learn from this inci…
-
Hello. I'm building a software need a scalable linear equation solver on cluster with multiple GPUs. The document on `linalg.solve` (https://docs.nvidia.com/cupynumeric/latest/api/generated/cupynumeri…
-
Hello,
I have a fasta file containing thousands of peptide sequences. I wanted to predict their 3D structures using LocalColabFold 1.5.5 installed in an HPC cluster and I have access to GPU clus…
-
When building a brand new cluster using these settings:
- [AWS with OpenShift Envioironment](https://catalog.demo.redhat.com/catalog?search=aws&item=babylon-catalog-prod%2Fsandboxes-gpte.sandbox-oc…
-
**Is your feature request related to a problem? Please describe.**
When doing gemm in Hopper, we need to decide the grid size based on problem size, cluster shape and the hopper architectures.
Curren…
-
Hi,
I was following the tutorial to get DRA running and so initially everything was working as expected until the installation of the driver.
The kubelet plugin directly fails with:
```
Error: error…
-
Hi,
would you be able to provide a script to download model weights? We are running Chai1 on cluster with job nodes having no internet access. Having such a script would really help us!
(Sometim…
-
RKE2 docs says only about passing the config for RKE2's internal CONTAINERD_SOCKET: https://docs.rke2.io/advanced?_highlight=gpu#deploy-nvidia-operator
Nvidia's also about CONTAINERD_CONFIG: https://…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues
### Environment
```markdown
- Milvus version: 2.4.12
- Deployment mode(standalone or cluster): both
- MQ type(ro…
-
Hello, NVIDIA Team.
I'm facing an issue while configurating `dcgm-exporter` from `gpu-operator`. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing…