Closed liujin1993 closed 3 weeks ago
集群余量可以通过curl {scheduler node ip}:31993/metrics 查询,没有gpumem是因为现在它是作为nvidia.com/gpu的参数存在,而不是一个独立的可调度资源,所以配置成ignore了
@archlitchi 感谢! 找到了,GPUDeviceMemoryLimit和GPUDeviceMemoryAllocated分别是卡级别的总量和分配量。需要一个servicemonitor采集scheduler指标才行,用helm部署不会自动创建
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: hami-scheduler-monitor
namespace: kube-system
labels:
app.kubernetes.io/component: hami-scheduler
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-scheduler
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: monitor
请问下为什么node的Capacity和Allocatable中只有nvidia.com/gpu,没有nvidia.com/gpumem?有时创建pod之前想先知道总共有多少mem,还剩多少mem。