Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
963 stars 199 forks source link

InvalidImageName when deploying HAMi after latest update #621

Closed musoles closed 5 days ago

musoles commented 1 week ago

What happened: I noticed that since the last update to 2.4.1, the helm chart has changed significantly, for instance no longer it's there the following fields: scheduler.kubeScheduler.imageTag, devicePlugin.deviceMemoryScaling or devicePlugin.deviceSplitCount. The worse outcome is that now when I deploy, I get an InvalidImageName error in the hami-vgpu-scheduler pod.

Here's the pod description:

Failed to apply default image tag "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.1+k3s1": couldn't parse image name "registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.31.1+k3s1": invalid reference format

What you expected to happen: I expect the HAMi scheduler to be deployed as it was before the change. Also it would be good to have a guide on how to transition to the newer version (how do I specify device split count and memory scaling?), plus how to be able to stick to the old version (I've tried with helm version 2.4.0 but I see the same error).

How to reproduce it (as minimally and precisely as possible): Run in a 1.31.1 kubernetes cluster the installer:

helm install hami-vgpu hami-charts/hami -n kalavai

Anything else we need to know?:

N/A

Environment:

musoles commented 1 week ago

I suspect the issue is with the image tagv1.31.1+k3s1, I know docker does not do "+" symbols. Where is this being captured? I don't know if I can define that in the chart.

Luigi600 commented 1 week ago

Previously, it was possible to set the version using a Helm value. However, this was actually removed with commit 7c6b722, and the automatic solution was chosen instead. In fact, there was a similar bug in the Rancher Turtles project related to K3s. A solution has been found for Turtles with a custom Helm function, see rancher/turtles#283.

Nimbus318 commented 1 week ago

@musoles I have created a new PR to fix this bug. Additionally, the PR restores compatibility with parameters like scheduler.kubeScheduler.imageTag, devicePlugin.deviceMemoryScaling and devicePlugin.deviceSplitCount, allowing users to configure them via helm install or helm upgrade.

I've also added a brief guide in docs/config.md on how to modify other configurations, including examples for both Helm and ConfigMap edits.