BaizeAI / kcover

Apache License 2.0
25 stars 1 forks source link

kcover - Kubernetes Coverage for Fault Awareness and Recovery

Welcome to kcover, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.

Features

Getting Started

Prerequisites

Ensure you have Kubernetes and Helm installed on your cluster. kcover is compatible with Kubernetes versions 1.19 and above.

Installation

Install kcover using Helm:

helm repo add baizeai https://baizeai.github.io/charts
helm install kcover baizeai/kcover --namespace kcover-system --create-namespace

Configuration

Configure kcover to monitor specific Kubernetes resources by labeling them:

kubectl label pytorchjobs <job-name> kcover.io/cascading-recovery=true
kubectl label pytorchjobs <job-name> kcover.io/need-recovery=true

Usage

Once installed, kcover will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.