NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.25k stars 326 forks source link

[WIP]Bump to latest Kubespray and accomodate docker deprecation in tests #1253

Closed supertetelman closed 1 year ago

supertetelman commented 1 year ago

Basic update to newest K8s and Kubespray versions. Docker is now officially unsupported in K8s and needed to remove the runtime from tests and documentation.

Please merge https://github.com/NVIDIA/deepops/pull/1250 first.

supertetelman commented 1 year ago

Currently blocked by this issue if someone wants to jump in and debug:


`FAILED - RETRYING: download_container | Download image if required (1 retries left).
fatal: [virtual-gpu01-0 -> virtual-gpu01-0]: FAILED! => changed=true 
  attempts: 4
  cmd:
  - /usr/local/bin/crictl
  - pull
  - quay.io/calico/node:v3.24.5
  delta: '0:00:00.039323'
  end: '2023-03-29 03:03:57.565413'
  msg: non-zero return code
  rc: 1
  start: '2023-03-29 03:03:57.526090'
  stderr: |-
    E0329 03:03:57.563464   15688 remote_image.go:222] "PullImage from image service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService" image="quay.io/calico/node:v3.24.5"
    time="2023-03-29T03:03:57Z" level=fatal msg="pulling image: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService"
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>`
supertetelman commented 1 year ago

Finally got this working and tested. Looks like that last small patch fixed the issues with the monitoring stack and metallb stack. I'd like to merge this PR through and then open up a new PR to bump GPU Operator versions and kubespray versions once again to the version that came out this past week.