-
### Description
**Context**
I have a GPU node pool, which defaults to 0 active nodes in order to save compute resources.
When I submit tasks that require a GPU, that node pool is scaled on deman…
-
mig-parted apply returns the following error in some circumstances:
```
time="2024-09-30T19:49:46Z" level=error msg="\nThe following GPUs could not be reset:\n GPU 00000000:00:06.0: In use by anoth…
-
Hi,
is there any metric to obtain information on the MIG devices? Got a MIG setup on a DGX A100 but I am not sure if it should identify them automatically or must do something
many thanks
-
**Description**
When there are multiple GPU, only one GPU is used.
**Triton Information**
Container: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
**To Reproduce**
Follow the instrcutio…
gyr66 updated
2 weeks ago
-
**What happened**: node-feature-discovery of gpu-operator sends excessive LIST requests to the API server
**What you expected to happen**:
Recently I got several alerts from K8S cluster which desc…
-
Hi 👋
We've upgraded from kops 1.23 to 1.26 (provider `1.26.0-rc1`). The upgrade was successful after some trial and error. Now, when we run apply again, the updater is always triggered:
```
#…
-
### 🐛 Describe the bug
When processing complex data type, torch.linalg.vector_norm raises an overflow error.
```python
import torch
>>> torch.linalg.vector_norm(torch.randn(3, 3), torch.tensor(2…
-
**Description**
I noticed that a model with several instances is slower than with one. I believe that this should not be the case, but throughput and latency indicators say the opposite.
**Triton …
-
### Description
**Observed Behavior**:
Nodes have been running for 15h without actual workloads. Only daemonset pods are running in it.
**Expected Behavior**:
Karpenter deletes the underutilize…
-
NHNCloud의 KR1 리전을 대상으로 VM Spec을 조회하면 다음과 같은 항목들이 조회되는데,
이 중 g2.v100.xxx, g2.t4.yyy 등은 GPU 인스턴스이기에 GPU 관련 내용이 함께 조회되어야 할 것으로 보입니다.
![image](https://github.com/cloud-barista/cb-spider/assets/2516326…