beacon-biosignals / K8sClusterManagers.jl

A Julia cluster manager for Kubernetes
Other
31 stars 5 forks source link

Test using local Kubernetes #23

Closed omus closed 3 years ago

omus commented 3 years ago

Working on getting K8sClusterManagers tested within a local k8s cluster. Having this will allow for improved tests and faster iteration.

Partially addresses: https://github.com/beacon-biosignals/K8sClusterManagers.jl/issues/8

omus commented 3 years ago

I'm working through the details of cluster deployment locally but I wanted to validate the minikube GHA works as advertised

omus commented 3 years ago

Setting up minikube in GHAs takes almost 2 minutes. Additionally the minikube GHA needs to be run on Ubuntu so I should refactor this into a separate job which can run the non-cluster tests independently

omus commented 3 years ago

The manager pod was being successfully launched but the worker pod was being stuck in pending. I've noticed that the minikube GHA defaults to driver: none so I'll try my luck with driver: docker which also supported by the GHA (https://github.com/marketplace/actions/setup-minikube-kubernetes-cluster#optional-input-parameters)

omus commented 3 years ago

When the documentation states GITHUB_ENV entries should be defined as {name}={value} they aren't messing around.

Including a comment? How about an error message:

Error: Unable to process file command 'env' successfully.
Error: Invalid environment variable format '# To point your shell to minikube's docker-daemon, run:'

Using export? No problem we'll just make your environmental variable exactly "export {name}". That's definitely what you want. Using quotes around your value? Obviously you want to include the quotes in your value.

Typically, I'm manually adding environmental variables so I've never noticed this before but since I don't know what variables minikube docker-env will emit I needed to be more general here.

omus commented 3 years ago

It appears that the minikube drivers "none" and "docker" have the same issue where the manager pod starts but the worker pod is stuck in pending. I'm assuming I don't have enough resources to launch both pods but I'm attempting to confirm this theory

Update: Events output from describing the worker pod:

Type     Reason            Age                From               Message
----     ------            ----               ----               -------
Warning  FailedScheduling  0s (x5 over 3m7s)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.
omus commented 3 years ago

Fun fact: the macOS GitHub runners have more CPUs and memory than the linux ones. Unfortunately the Docker Buildx action is unsupported for macOS. I remember the minikube GHA also stated it was compatible only with Ubuntu.

There are a couple of options open to us still:

  1. Disable the tests that execute on Kubernetes
  2. Use self-hosted runners with more resources available
  3. Use a cloud based Kubernetes service like EKS

Possibly, there are some additional options if I can can get the minikube to oversubscribe.

omus commented 3 years ago

Oversubscribing seems promising. Looks like I need to push the image to both nodes:

Type     Reason             Age   From               Message
----     ------             ----  ----               -------
Normal   Scheduled          1s    default-scheduler  Successfully assigned default/test-worker-success-kbclk to minikube-m02
Warning  ErrImageNeverPull  0s    kubelet            Container image "k8s-cluster-managers:add85c3" is not present with pull policy of Never
Warning  Failed             0s    kubelet            Error: ErrImageNeverPull

I'll continue looking into this approach

omus commented 3 years ago

On multi-node clusters you can no longer use minikube docker-env as you'll be greeted with:

❌  Exiting due to ENV_MULTINODE_CONFLICT: The docker-env command is incompatible with multi-node clusters. Use the 'registry' add-on: https://minikube.sigs.k8s.io/docs/handbook/registry/

I did attempt to use the registry addon following the official instructions but they seemed like overkill for use in a CI environment when you're setting up as often as you are pushing images. Because of this I ended up using minikube ssh and Docker's save/load to transfer the image on to the nodes

omus commented 3 years ago

On one of the CI runs the manager saw this error in the events list: MountVolume.SetUp failed for volume "julia-manager-serviceaccount-token-hrtmt" : failed to sync secret cache: timed out waiting for the condition. It didn't seem to impact the run though and the next CI run didn't see this.

omus commented 3 years ago

What seems to be the last remaining CI issue is that the manager is unable to establish a connection to the worker over the network. I've managed to get these tests working on my local multi-node minikube cluster so I believe there's something special about the CI environment I need to adjust for.

omus commented 3 years ago

πŸŽ‰ And finally we have a functional Kubernetes test that works locally and on CI. I have some refactoring to do but the hard work is over 😌

omus commented 3 years ago

The previous cluster tests on the CI failed due to not being able to pull the image (https://github.com/beacon-biosignals/K8sClusterManagers.jl/runs/2377192899). That's probably the last thing that needs to be investigated before merging this PR

codecov-commenter commented 3 years ago

Codecov Report

Merging #23 (a0ff4ba) into main (6344023) will increase coverage by 6.54%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #23      +/-   ##
==========================================
+ Coverage   29.03%   35.57%   +6.54%     
==========================================
  Files           2        2              
  Lines          93      104      +11     
==========================================
+ Hits           27       37      +10     
- Misses         66       67       +1     
Impacted Files Coverage Ξ”
src/native_driver.jl 29.34% <0.00%> (+8.36%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data Powered by Codecov. Last update 6344023...a0ff4ba. Read the comment docs.

omus commented 3 years ago

I just need to add some documentation on minikube docker-env

omus commented 3 years ago

This beast is RTM