[Sandbox] HAMi - Githubissues

wawa0210 commented 3 months ago

Application contact emails

limengxuan@4paradigm.com, xiaozhang0210@hotmail.com

Project Summary

Heterogeneous AI Computing Virtualization Middleware (HAMi), is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster.

Project Description

Heterogeneous AI Computing Virtualization Middleware (HAMi) is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It includes everything you would expect, such as:

Heterogeneous AI computing device support, currently supports: Nvidia, Cambricon, Hygon, Huawei Ascend, iluvatar
Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
Task priority: supports tasks using the same AI computing device to define different priorities. When resources are preempted, high-priority tasks have high QOS
CUDA Unified memory: When the GPU memory is not enough, it supports expanded use of node memory.
Easy to use: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than nvidia.com/gpu if you prefer.

The core features of HAMi are as follows

Hard Limit on Device Memory.
Allows partial device allocation by specifying device memory.
Imposes a hard limit on streaming multiprocessors.
flexible binpack&spread schedule policies base on gpu device and node
Permits partial device allocation by specifying device core usage.
Requires zero changes to existing programs.

The HAMi architecture is as follows

Application Scenarios

Device sharing (or device virtualization) on Kubernetes.
Scenarios where pods need to be allocated with specific device memory
Need to balance GPU usage in a cluster with multiple GPU nodes.
Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU.
Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances.

Org repo URL (provide if all repos under the org are in scope of the application)

https://github.com/Project-HAMi

Project repo URL in scope of application

core repo : https://github.com/Project-HAMi/HAMi

And the corresponding multi-public repo https://github.com/Project-HAMi/

Additional repos in scope of the application

No response

Website URL

http://project-hami.io/

Roadmap

https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#roadmap

Roadmap context

Production	manufactor	MemoryIsolation	CoreIsolation	MultiCard support
GPU	NVIDIA	✅	✅	✅
MLU	Cambricon	✅	❌	❌
DCU	Hygon	✅	✅	❌
Ascend	Huawei	In progress	In progress	❌
GPU	iluvatar	In progress	In progress	❌
DPU	Teco	In progress	In progress	❌

Support video codec processing
Support Multi-Instance GPUs (MIG)
Support Flexible scheduling policies
- binpack
- spread
- numa affinity
integrated gpu-operator
Rich observability support
DRA Support
Support Intel GPU device
Support AMD GPU device

Contributing Guide

https://github.com/Project-HAMi/HAMi/blob/master/CONTRIBUTING.md

Here are our community meeting minutes

https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit?usp=sharing

Code of Conduct (CoC)

https://github.com/Project-HAMi/HAMi/blob/master/CODE_OF_CONDUCT.md

Adopters

We have done a survey and found that dozens of adopters are already using HAMi. We will maintain it in the HAMi documentation later. Online survey results

Contributing or Sponsoring Org

4paradigm,DaoCloud, HuaweiCloud,Rise Union

Maintainers file

https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md

IP Policy

[X] If the project is accepted, I agree the project will follow the CNCF IP Policy

Trademark and accounts

[X] If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Why CNCF?

The CNCF is the premier organization for cloud-native technologies and is backed by many leading companies in the industry. It also provides a platform for collaboration and community-building, which can lead to increased visibility, adoption, and contributions to HAMi.

At the same time, HAMi can be combined with more outstanding CNCF projects (such as: Volcano, Kuberay, Kueue) to provide one-stop service for AI infrastructure.

Benefit to the Landscape

As AI becomes more and more popular, different smart devices are springing up, represented by Nvidia, but there are many other smart devices that are also actively embracing K8s and CNCF. But how these numerous GPUs, NPUs and other devices can provide a consistent interactive experience on one platform is particularly important. This is exactly what HAMi is focused on doing. If users use HAMi, it will greatly simplify the management and operation of these GPUs and NPUs on K8s, and the application layer does not need to be aware of the differences in underlying hardware.

Cloud Native 'Fit'

HAMi is built using cloud native technology. It has now used scheduler-plugin, webhook, device-plugin and other technologies to manage and schedule heterogeneous AI computing devices. In the future, it will consider using DRA for architecture optimization.

Cloud Native 'Integration'

HAMi refers to the nvidia device-plugin project part of source codes to support nvidia gpu basic features. On top of this, we support the following functions for nvidia gpu extensions.

Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
hami provides scheduling enhancement capabilities based on kube-scheduler and supports binpack&spread capabilities at the node and gpu device levels.

Cloud Native Overlap

We do not think there is direct overlap at this time with other CNCF projects. However, we do touch on some of the areas that other projects are investigating in the space of device-plugin，and scheduler enhancement.

Volcano also provides the ability to share GPUs. In version v1.8, the features of volcano-vgpu were contributed to the volcano repo by hami maintainer. However, after discussions with the maintainer of volcano, in order to support the independent development of the hami community, it was decided to release it in version v1.9. Later, this part of the function was transferred to the HAMi project and maintained by the HAMi community (repo --> https://github.com/Project-HAMi/volcano-vgpu-device-plugin)

Similar projects

Some comparisons with similar projects to HAMi

highlight

nvidia-device-plugin and k8s-dra-driver only supports nvidia devices and does not support other heterogeneous AI computing devices
nvidia-device-plugin and k8s-dra-driver focuses on the combination of gpu and K8s, and does not focus on scheduling enhancements and rich observability indicators.

Comparison of GPU sharing solutions

Landscape

yes

HAMi is in landscape and also in cnai group

https://landscape.cncf.io/?group=cnai

Business Product or Service to Project separation

N/A

Project presentations

No response

Project champions

No response

Additional information

No response

raravena80 commented 3 weeks ago

TAG-Runtime

dims commented 6 days ago

Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
is the k8s-dra-driver fork for convenience or is it really going to be a fork?

archlitchi commented 5 days ago

Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?

is the k8s-dra-driver fork for convenience or is it really going to be a fork?

all public repos are on the scope for donation

k8s-dra-driver are forked for convenience, we plan to make our own dra-driver

wawa0210 commented 4 days ago

Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?

is the k8s-dra-driver fork for convenience or is it really going to be a fork?

We've been exploring the combination of HAMi and DRA and are currently in the roadmap as well

cncf / sandbox

[Sandbox] HAMi #97

Application contact emails

Project Summary

Project Description

Org repo URL (provide if all repos under the org are in scope of the application)

Project repo URL in scope of application

Additional repos in scope of the application

Website URL

Roadmap

Roadmap context

Contributing Guide

Code of Conduct (CoC)

Adopters

Contributing or Sponsoring Org

Maintainers file

IP Policy

Trademark and accounts

Why CNCF?

Benefit to the Landscape

Cloud Native 'Fit'

Cloud Native 'Integration'

Cloud Native Overlap

Similar projects

Landscape

Business Product or Service to Project separation

Project presentations

Project champions

Additional information