flux-framework / flux-operator

Deploy a Flux MiniCluster to Kubernetes with the operator
https://flux-framework.org/flux-operator/
MIT License
31 stars 8 forks source link

allow to use hostNetwork pods #204

Closed aojea closed 1 year ago

aojea commented 1 year ago

This allows to run Pods on the host namespaces, it is a power user option to tune directly the node and avoid the shortcoming of having to run on its own network namespaces

aojea commented 1 year ago

/assign @vsoch @alculquicondor

alculquicondor commented 1 year ago

Would this mean that there can only be one pod per node?

Maybe it's worth defining a port for the container. Note that scheduler wouldn't put two pods in the same node if they use the same port, which is kind of pod anti-affinity, but it's fast for the scheduler to calculate.

aojea commented 1 year ago

Would this mean that there can only be one pod per node?

you can have many hostNetwork pods in the same node and we have the antiaffinity rules ... the question is if multiple flux agents can run in the same node

aojea commented 1 year ago

/hold

I need to revisit this, Aldo is right, it seems better that if you set hostNetwork to enforce 2 pods does not run in the same node

vsoch commented 1 year ago

@aojea it looks like you added the attribute to the service container spec, and we probably want it for the MiniCluster here? https://github.com/flux-framework/flux-operator/blob/ddd1c836910d600d941f453800b6a5d06916031e/controllers/flux/job.go#L64

Also could we please add to MiniclusterSpec.Network so it’s namespaced as a networking attribute?

Finally can you explain to me what it means to use the host network / vs not? Thank you!

vsoch commented 1 year ago

@aojea when you make changes, also please add this new attribute (documentation for it) in the docs/getting-started/custom-resource-definition.md. That's where the user reads about the different spec fields.

aojea commented 1 year ago

Let's draft it

vsoch commented 1 year ago

@aojea if you give me permission I can make these changes to the PR.

aojea commented 1 year ago

@aojea if you give me permission I can make these changes to the PR.

Sure, of course, please go ahead

vsoch commented 1 year ago

@aojea do we need to do this too? https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy

vsoch commented 1 year ago

I was too nervous to test hostNetwork on my actual host, so I added a test with the variable! I also found https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy and thought we might need to set the DNSPolicy to ClusterFirstWithHostNet. We can discuss / we will see what happens!

vsoch commented 1 year ago

Looks like there is some issue with weight (and I'm wondering why this didn't trigger before?) I'm off to bed but will read more about this tomorrow - I put in random values for now to try out.

aojea commented 1 year ago

ClusterFirstWithHostNet

good catch, yes we do need it

1.6935564789918168e+09 ERROR Reconciler error {"controller": "minicluster", "controllerGroup": "flux-framework.org", "controllerKind": "MiniCluster", "miniCluster": {"name":"flux-sample","namespace":"flux-operator"}, "namespace": "flux-operator", "name": "flux-sample", "reconcileID": "aab05d2d-8a40-4a10-b2a4-bedf682a3e3d", "error": "Job.batch \"flux-sample\" is invalid: [spec.template.spec.affinity.podAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].weight: Invalid value: 0: must be in the range 1-100, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].weight: Invalid value: 0: must be in the range 1-100]"}

@alculquicondor I thought the API will default it if empty

vsoch commented 1 year ago

ah interesting - so with hostNetwork added, the pod is actually detecting it's hostname as minikube. I wonder if the pod indices / hostnames are just totally gone and each pod is just minikube?

LOGS for Flux Operator Sample
flux-broker: Config file error [bootstrap]: minikube not found in hosts
flux-broker: bootstrap failed
Error: Process completed with exit code 1.

I will need to try this locally after all - will report back.

vsoch commented 1 year ago

okay some updates - we will need to think through how / if this is going to work. First, the worker pods cannot schedule:

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  2m26s (x2 over 2m28s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
$ kubectl get pods -n flux-operator 
NAME                  READY   STATUS             RESTARTS      AGE
flux-sample-0-ln79t   0/1     CrashLoopBackOff   3 (20s ago)   2m39s
flux-sample-1-rxqfm   0/1     Pending            0             2m39s
flux-sample-2-pvgg2   0/1     Pending            0             2m39s
flux-sample-3-kgbb6   0/1     Pending            0             2m39s

As for the lead broker (and the flux config) it looks like the hostnames are now just kind-control-plane and (on here) minikube

👋 Hello, I'm kind-control-plane
The main host is flux-sample-0

So the headless service seems to be over-ridden. We do need each to have a unique hostname it can use to identity itself (and the rank) so I'm not sure this strategy will work! I'll read about other DNS options maybe. In the meantime - any ideas @aojea ?

vsoch commented 1 year ago

I made this follow up PR if we want it - https://github.com/flux-framework/flux-operator/pull/205. It's exactly the same as the one here, but without host network. I actually think further testing / thinking about minimalService might be the way to go - that exposes only the broker on port 8050, and the worker pods should still have their proper DNS names.

aojea commented 1 year ago

hostNetwork means the pod is the host itself, so it inherits the node name as hostname, I think that for this operator better not use it by now, it adds an unnecessary complexity

vsoch commented 1 year ago

Gotcha! I don't think it would work then, but it's a really cool idea! And actually this might be quite useful for our other work with usernetes - being able to just "use" the host network without a hitch is likely desired for specific workflows.