[WIP]Bump to latest Kubespray and accomodate docker deprecation in tests

supertetelman commented 1 year ago

Basic update to newest K8s and Kubespray versions. Docker is now officially unsupported in K8s and needed to remove the runtime from tests and documentation.

Remove docker from all tests
Add back local image registry test and dle test
Bump kubespray from v1.19 to v1.21 (K8s v1.22-> K8s v1.25)
MetalLB version bump and move towards new post-install configuration method (0.12.1 to 0.13.9)
Update docs

Please merge https://github.com/NVIDIA/deepops/pull/1250 first.

supertetelman commented 1 year ago

Currently blocked by this issue if someone wants to jump in and debug:


`FAILED - RETRYING: download_container | Download image if required (1 retries left).
fatal: [virtual-gpu01-0 -> virtual-gpu01-0]: FAILED! => changed=true 
  attempts: 4
  cmd:
  - /usr/local/bin/crictl
  - pull
  - quay.io/calico/node:v3.24.5
  delta: '0:00:00.039323'
  end: '2023-03-29 03:03:57.565413'
  msg: non-zero return code
  rc: 1
  start: '2023-03-29 03:03:57.526090'
  stderr: |-
    E0329 03:03:57.563464   15688 remote_image.go:222] "PullImage from image service failed" err="rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService" image="quay.io/calico/node:v3.24.5"
    time="2023-03-29T03:03:57Z" level=fatal msg="pulling image: rpc error: code = Unimplemented desc = unknown service runtime.v1alpha2.ImageService"
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>`

supertetelman commented 1 year ago

Finally got this working and tested. Looks like that last small patch fixed the issues with the monitoring stack and metallb stack. I'd like to merge this PR through and then open up a new PR to bump GPU Operator versions and kubespray versions once again to the version that came out this past week.

NVIDIA / deepops

[WIP]Bump to latest Kubespray and accomodate docker deprecation in tests #1253