As a DevSecOps, I would like to deploy a secondary kubernetes cluster in Azure with enhanced node pool features

Executive summary

The current Kubernetes cluster setup lacks the flexibility to switch node pool types, which limits our capabilities for resource-intensive computing requirements. This issue proposes the creation of a secondary Kubernetes cluster in Azure specifically tailored with a node pool that include GPU capabilities, aimed at optimizing computational resources for AI-based projects.

Context

Our existing Kubernetes infrastructure does not support changes to the node pool configuration after initial setup, which restricts our ability to adapt to evolving project needs. The primary requirement for the new cluster is to support advanced computational tasks which involve heavy AI and machine learning workloads. These tasks require significantly higher computational power, including the use of GPUs. By leveraging Istio, which is natively supported in Azure Kubernetes Service (AKS), we aim to implement a multi-cluster mesh that enhances connectivity and management ease across our clusters.

TODO

[x] #204
[x] #205
- Deploy the cluster within Azure using Terraform
- Select and configure the appropriate node pool according to identified requirements.
[x] #211
[x] #206
- Install Istio on the new Kubernetes cluster.
- Configure Istio to enable seamless communication and management between the two clusters.
[x] #217
- Create 1 instance of the ollama deployment on the cluster that contains GPUs
- Create 1 deployment of openweb-ui on the cluster that does not have GPUs
- Connect openweb-ui with ollama by changing this environment variable OLLAMA_BASE_URL
[x] #207
- Set up node labels and taints to organize nodes effectively based on their capacities and intended usage.
- Use Kubernetes affinity and anti-affinity rules to ensure optimal allocation and scheduling of workloads.
[x] #208
- Conduct tests to ensure the new cluster and its node pools are configured correctly.
- Run AI-based computational tasks to validate the performance enhancements achieved with the new setup.
[x] #209
- Document the entire setup and configuration process.
- Provide training and support to team members to adapt to the new Kubernetes environment.

References

Istio multicluster mesh Azure itsio service mesh AKS GPU workloads

ai-cfia / howard