beacon-biosignals / K8sClusterManagers.jl

A Julia cluster manager for Kubernetes
Other
31 stars 5 forks source link

Add `az_node_selector` pod configuration function #93

Open christopher-dG opened 2 years ago

christopher-dG commented 2 years ago

@vosscodes and I came up with this approach to making sure that jobs stay in the same AZ: you basically just need to make your configure function call the az_node_selector helper and that will ensure that worker nodes are all in the same AZ as the manager node.

It does require some permissions:

Tests and docs to come soon.

Examples:

If you weren't already using a configure function:

using K8sClusterManagers, Distributed
k8s = K8sClusterManager(2; configure=K8sClusterManagers.az_node_selector)
addprocs(k8s)

Or if you already had a configure function, it just needs to call that same az_node_selector:

using K8sClusterManagers, Distributed

function configure(pod)
    pod["spec"]["containers"][1]["requests"]["cpu"] = 2
    # ...
    K8sClusterManagers.az_node_selector(pod)  # mutates the manifest
    return pod
end

k8s = K8sClusterManager(2; configure=configure)
addprocs(k8s)
ericphanson commented 2 years ago

I wonder if this should just be applied by default? e.g. could call it in here: https://github.com/beacon-biosignals/K8sClusterManagers.jl/blob/905c995d0ca6300a188e2e8dc7ef5c87a7548295/src/pod.jl#L193

Feels like you'd pretty much always want this

edit: or maybe it can be refactored into get_current_az() and then it can be a keyword argument that defaults to get_current_az()?

also, should it be written in terms of zone, which seems to be the k8s-generic term for this? e.g. get_current_zone()

omus commented 2 years ago

Feels like you'd pretty much always want this

I agree with this. I'd roll this into the K8sClusterManager constructor. We already retrieve the manager pod information there.

christopher-dG commented 2 years ago

Feels like you'd pretty much always want this

I'm wary of making something like this a default because while it may make sense for all of our use cases, it requires cluster permissions and not all workloads need to send data between pods.

ericphanson commented 2 years ago

What about

function get_zone()
     try
          readchomp(...)
     catch e
          if e ...
             @warn "Could not get zone; defaulting to not adding a zone restriction" maxlog=1 exception=e
             nothing
          else
              rethrow()
          end
    end
end

and K8sManager has zone = get_zone() keyword arg? Then you get a warning (once) but it still works

codecov[bot] commented 2 years ago

Codecov Report

Merging #93 (7668bd8) into main (905c995) will decrease coverage by 3.37%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
- Coverage   67.05%   63.68%   -3.38%     
==========================================
  Files           4        4              
  Lines         170      179       +9     
==========================================
  Hits          114      114              
- Misses         56       65       +9     
Impacted Files Coverage Δ
src/pod.jl 89.10% <0.00%> (-8.72%) :arrow_down:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more