Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
956 stars 197 forks source link

hami-scheduler不能部署knative的服务 #630

Closed 120L020430 closed 11 hours ago

120L020430 commented 1 day ago

What happened: 当我在部署knative的函数时,pod无法成功被部署

Events:
  Type     Reason            Age   From            Message
  ----     ------            ----  ----            -------
  Warning  FailedScheduling  20s   hami-scheduler  0/4 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
  Warning  FilteringFailed   21s   hami-scheduler  no available node, all node scores do not meet

What you expected to happen: 能成功部署在相应节点上 How to reproduce it (as minimally and precisely as possible): 我使用的是k3s+knative,kantive是一个serverless函数管理框架,可以参照https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/ 安装 我的服务的部署文件如下:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: feature-extractor
spec:
  template:
    metadata:
      annotations:
        # Knative concurrency-based autoscaling (default).
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: concurrency
        # Target 10 requests in-flight per pod.
        autoscaling.knative.dev/target: "40"
        # Disable scale to zero with a min scale of 1.
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/initial-scale: "1"
        # Limit scaling to 100 pods.
        autoscaling.knative.dev/max-scale: "1"
        autoscaling.knative.dev/scale-down-delay: "40s"
    spec:
      containerConcurrency: 40
#      nodeSelector:
#        node-role.kubernetes.io/master: "true"
      containers:
        - image: 192.168.10.82:5000/feature-extractor-gpu:v3.0
          ports:
            - containerPort: 6001
          imagePullPolicy: IfNotPresent
          name: feature-extractor
          resources:
            limits:
              nvidia.com/gpu: 8
              nvidia.com/gpumem-percentage: 80
          volumeMounts:
            - mountPath: /home/data/
              name: reid
      volumes:
        - name: reid
          persistentVolumeClaim:
            claimName: reid-pvc
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values:
                      - pve

我尝试完全不指定节点、使用nodeaffinity以及nodeSelector这三种方法进行部署,但都报最上面的错 而我在部署正常的k8s的service或者examples中的nvidia的pod例子时,结果是正确的

Anything else we need to know?:

Environment:

Nimbus318 commented 12 hours ago

你的节点上有 8 张物理卡吗

120L020430 commented 12 hours ago

你的节点上有 8 张物理卡吗

只有一张,但是这里的nvidia.com/gpu指的不是hami划分的vgpu吗?默认扩容比例是10,我的这个节点上可分配的nvidia.com/gpu是10。而且我也说了部署examples中的nvidia的例子是可以用的,只是和knative结合无法部署

Nimbus318 commented 12 hours ago

我最近正在写关于 HAmi 的一个 FAQ,不过还处于起草阶段 https://v6eky86feo.feishu.cn/wiki/ViC8wFcItiCCzjkx2c3cgXMhnwh

到时候写完了会放在相关的文档里

里面有关于这个问题的回答:

CleanShot 2024-11-21 at 10 33 17@2x

120L020430 commented 11 hours ago

我最近正在写关于 HAmi 的一个 FAQ,不过还处于起草阶段 https://v6eky86feo.feishu.cn/wiki/ViC8wFcItiCCzjkx2c3cgXMhnwh

到时候写完了会放在相关的文档里

里面有关于这个问题的回答:

CleanShot 2024-11-21 at 10 33 17@2x

明白了,感谢感谢