hami-scheduler不能部署knative的服务

120L020430 commented 1 day ago

What happened: 当我在部署knative的函数时，pod无法成功被部署

Events:
  Type     Reason            Age   From            Message
  ----     ------            ----  ----            -------
  Warning  FailedScheduling  20s   hami-scheduler  0/4 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Warning  FilteringFailed   21s   hami-scheduler  no available node, all node scores do not meet

What you expected to happen: 能成功部署在相应节点上 How to reproduce it (as minimally and precisely as possible): 我使用的是k3s+knative，kantive是一个serverless函数管理框架，可以参照https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/ 安装我的服务的部署文件如下：

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: feature-extractor
spec:
  template:
    metadata:
      annotations:
        # Knative concurrency-based autoscaling (default).
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: concurrency
        # Target 10 requests in-flight per pod.
        autoscaling.knative.dev/target: "40"
        # Disable scale to zero with a min scale of 1.
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/initial-scale: "1"
        # Limit scaling to 100 pods.
        autoscaling.knative.dev/max-scale: "1"
        autoscaling.knative.dev/scale-down-delay: "40s"
    spec:
      containerConcurrency: 40
#      nodeSelector:
#        node-role.kubernetes.io/master: "true"
      containers:
        - image: 192.168.10.82:5000/feature-extractor-gpu:v3.0
          ports:
            - containerPort: 6001
          imagePullPolicy: IfNotPresent
          name: feature-extractor
          resources:
            limits:
              nvidia.com/gpu: 8
              nvidia.com/gpumem-percentage: 80
          volumeMounts:
            - mountPath: /home/data/
              name: reid
      volumes:
        - name: reid
          persistentVolumeClaim:
            claimName: reid-pvc
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - pve

我尝试完全不指定节点、使用nodeaffinity以及nodeSelector这三种方法进行部署，但都报最上面的错而我在部署正常的k8s的service或者examples中的nvidia的pod例子时，结果是正确的

Anything else we need to know?:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The hami-device-plugin container logs
The hami-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Any relevant kernel output lines from dmesg

Environment:

HAMi version:
nvidia driver or other AI device driver version:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Others:

Nimbus318 commented 12 hours ago

你的节点上有 8 张物理卡吗

120L020430 commented 12 hours ago

你的节点上有 8 张物理卡吗

只有一张，但是这里的nvidia.com/gpu指的不是hami划分的vgpu吗？默认扩容比例是10，我的这个节点上可分配的nvidia.com/gpu是10。而且我也说了部署examples中的nvidia的例子是可以用的，只是和knative结合无法部署

Nimbus318 commented 12 hours ago

我最近正在写关于 HAmi 的一个 FAQ，不过还处于起草阶段 https://v6eky86feo.feishu.cn/wiki/ViC8wFcItiCCzjkx2c3cgXMhnwh

到时候写完了会放在相关的文档里

里面有关于这个问题的回答：

CleanShot 2024-11-21 at 10 33 17@2x

120L020430 commented 11 hours ago

我最近正在写关于 HAmi 的一个 FAQ，不过还处于起草阶段 https://v6eky86feo.feishu.cn/wiki/ViC8wFcItiCCzjkx2c3cgXMhnwh

到时候写完了会放在相关的文档里

里面有关于这个问题的回答：

明白了，感谢感谢

Project-HAMi / HAMi

hami-scheduler不能部署knative的服务 #630