cncf / sandbox

Applications for Sandbox go here! β³πŸ“¦πŸ§ͺ
Apache License 2.0
121 stars 19 forks source link

[Sandbox] HAMi #97

Open wawa0210 opened 3 months ago

wawa0210 commented 3 months ago

Application contact emails

limengxuan@4paradigm.com, xiaozhang0210@hotmail.com

Project Summary

Heterogeneous AI Computing Virtualization Middleware (HAMi), is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster.

Project Description

Heterogeneous AI Computing Virtualization Middleware (HAMi) is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It includes everything you would expect, such as:

  1. Heterogeneous AI computing device support, currently supports: Nvidia, Cambricon, Hygon, Huawei Ascend, iluvatar
  2. Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
  3. Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
  4. Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
  5. Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
  6. Task priority: supports tasks using the same AI computing device to define different priorities. When resources are preempted, high-priority tasks have high QOS
  7. CUDA Unified memory: When the GPU memory is not enough, it supports expanded use of node memory.
  8. Easy to use: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than nvidia.com/gpu if you prefer.

The core features of HAMi are as follows

The HAMi architecture is as follows

image

Application Scenarios

  1. Device sharing (or device virtualization) on Kubernetes.
  2. Scenarios where pods need to be allocated with specific device memory
  3. Need to balance GPU usage in a cluster with multiple GPU nodes.
  4. Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU.
  5. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances.

Org repo URL (provide if all repos under the org are in scope of the application)

https://github.com/Project-HAMi

Project repo URL in scope of application

core repo : https://github.com/Project-HAMi/HAMi

And the corresponding multi-public repo https://github.com/Project-HAMi/

Additional repos in scope of the application

No response

Website URL

http://project-hami.io/

Roadmap

https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#roadmap

Roadmap context

Production manufactor MemoryIsolation CoreIsolation MultiCard support
GPU NVIDIA βœ… βœ… βœ…
MLU Cambricon βœ… ❌ ❌
DCU Hygon βœ… βœ… ❌
Ascend Huawei In progress In progress ❌
GPU iluvatar In progress In progress ❌
DPU Teco In progress In progress ❌

Contributing Guide

https://github.com/Project-HAMi/HAMi/blob/master/CONTRIBUTING.md

Here are our community meeting minutes

https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit?usp=sharing

Code of Conduct (CoC)

https://github.com/Project-HAMi/HAMi/blob/master/CODE_OF_CONDUCT.md

Adopters

We have done a survey and found that dozens of adopters are already using HAMi. We will maintain it in the HAMi documentation later. Online survey results

Contributing or Sponsoring Org

4paradigm,DaoCloud, HuaweiCloud,Rise Union

Maintainers file

https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md

IP Policy

Trademark and accounts

Why CNCF?

The CNCF is the premier organization for cloud-native technologies and is backed by many leading companies in the industry. It also provides a platform for collaboration and community-building, which can lead to increased visibility, adoption, and contributions to HAMi.

At the same time, HAMi can be combined with more outstanding CNCF projects (such as: Volcano, Kuberay, Kueue) to provide one-stop service for AI infrastructure.

Benefit to the Landscape

As AI becomes more and more popular, different smart devices are springing up, represented by Nvidia, but there are many other smart devices that are also actively embracing K8s and CNCF. But how these numerous GPUs, NPUs and other devices can provide a consistent interactive experience on one platform is particularly important. This is exactly what HAMi is focused on doing. If users use HAMi, it will greatly simplify the management and operation of these GPUs and NPUs on K8s, and the application layer does not need to be aware of the differences in underlying hardware.

Cloud Native 'Fit'

HAMi is built using cloud native technology. It has now used scheduler-plugin, webhook, device-plugin and other technologies to manage and schedule heterogeneous AI computing devices. In the future, it will consider using DRA for architecture optimization.

Cloud Native 'Integration'

HAMi refers to the nvidia device-plugin project part of source codes to support nvidia gpu basic features. On top of this, we support the following functions for nvidia gpu extensions.

  1. Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
  2. Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
  3. Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
  4. Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
  5. hami provides scheduling enhancement capabilities based on kube-scheduler and supports binpack&spread capabilities at the node and gpu device levels.

Cloud Native Overlap

We do not think there is direct overlap at this time with other CNCF projects. However, we do touch on some of the areas that other projects are investigating in the space of device-plugin,and scheduler enhancement.

Volcano also provides the ability to share GPUs. In version v1.8, the features of volcano-vgpu were contributed to the volcano repo by hami maintainer. However, after discussions with the maintainer of volcano, in order to support the independent development of the hami community, it was decided to release it in version v1.9. Later, this part of the function was transferred to the HAMi project and maintained by the HAMi community (repo --> https://github.com/Project-HAMi/volcano-vgpu-device-plugin)

Similar projects

Some comparisons with similar projects to HAMi

image

highlight

Comparison of GPU sharing solutions

image

Landscape

yes

image

HAMi is in landscape and also in cnai group

image

https://landscape.cncf.io/?group=cnai

Business Product or Service to Project separation

N/A

Project presentations

No response

Project champions

No response

Additional information

No response

raravena80 commented 3 weeks ago

TAG-Runtime

dims commented 6 days ago
archlitchi commented 5 days ago
  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

all public repos are on the scope for donation

k8s-dra-driver are forked for convenience, we plan to make our own dra-driver

wawa0210 commented 4 days ago
  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

We've been exploring the combination of HAMi and DRA and are currently in the roadmap as well