Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6.
General
Re-work of large portion of documentation
Updates to NCCL tests
Various bug fixes
Slurm
Update to Slurm 22.05.2
Add Alertmanager integration
Option to share Slurm configuration among nodes via NFS
Enhancements to Slurm re-install/re-build tasks
Kubernetes
Update to Kubernetes 1.24.4
Update to GPU Operator 1.11.1 (GPU driver branch 515)
Changes
Bugs/Enhancements
Update NVIDIA driver role (#1216)
Update Kubespray submodule URL (#1200)
Add Alertmanager to Slurm cluster deployment (#1198)
Fix Slurm configuration GRES syntax (#1196)
Update Pyxis image cache size (#1191)
Updates to documentation (#1188)
Fix Slurm reinstall/rebuild tasks (#1187)
Update MetalLB helm repo (#1185)
Update EPEL GPG key (#1184)
Add option to share Slurm configuration among nodes (#1182)
Update NCCL tests (#1180, #1209)
Netapp Trident fix PATH (#1176)
Update default Slurm version to 21.08.8 (#1169, #1171)
Update NVIDIA signing key (#1166, #1167)
Update Ansible (#1165)
Upgrade Steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.04 run git diff 22.04 22.08 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.
DeepOps 22.08 Release Notes
Known Issues
General
Slurm
Kubernetes
Changes
Bugs/Enhancements
Upgrade Steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the
./scripts/setup.sh
script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release22.04
rungit diff 22.04 22.08 -- config.example/
. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.Notes