Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
963 stars 199 forks source link

为什么node的Capacity和Allocatable中只有nvidia.com/gpu,没有nvidia.com/gpumem #579

Closed liujin1993 closed 3 weeks ago

liujin1993 commented 3 weeks ago

请问下为什么node的Capacity和Allocatable中只有nvidia.com/gpu,没有nvidia.com/gpumem?有时创建pod之前想先知道总共有多少mem,还剩多少mem。 image

image

archlitchi commented 3 weeks ago

集群余量可以通过curl {scheduler node ip}:31993/metrics 查询,没有gpumem是因为现在它是作为nvidia.com/gpu的参数存在,而不是一个独立的可调度资源,所以配置成ignore了

liujin1993 commented 3 weeks ago

@archlitchi 感谢! 找到了,GPUDeviceMemoryLimit和GPUDeviceMemoryAllocated分别是卡级别的总量和分配量。需要一个servicemonitor采集scheduler指标才行,用helm部署不会自动创建

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-scheduler-monitor
  namespace: kube-system
  labels:
    app.kubernetes.io/component: hami-scheduler
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
  namespaceSelector:
    matchNames:
    - kube-system
  endpoints:
  - port: monitor

image