Welcome to kcover
, a Kubernetes solution designed to enhance the reliability and resilience of large-scale AI workloads by providing fault awareness and robust instant recovery mechanisms.
Ensure you have Kubernetes and Helm installed on your cluster. kcover
is compatible with Kubernetes versions 1.19 and above.
Install kcover
using Helm:
helm repo add baizeai https://baizeai.github.io/charts
helm install kcover baizeai/kcover --namespace kcover-system --create-namespace
Configure kcover
to monitor specific Kubernetes resources by labeling them:
kubectl label pytorchjobs <job-name> kcover.io/cascading-recovery=true
kubectl label pytorchjobs <job-name> kcover.io/need-recovery=true
Once installed, kcover
will automatically monitor the labeled resources for any signs of failures and perform recovery actions as specified in the configuration.