canonical / k8s-operator

Machine charm for K8s following the operator framework
Apache License 2.0
4 stars 3 forks source link

Control Plane does not go into blocked when no relation to worker #90

Open beliaev-maksim opened 7 months ago

beliaev-maksim commented 7 months ago

Bug Description

If you deploy juju model and do not relate CP with workers, then worker goes into blocked state indicating that CP should be related. The opposite does not happen

To Reproduce

-

Environment

rev 47

Relevant log output

maksim@darmbeliaev:~$ juju status --relations 
Model          Controller              Cloud/Region         Version  SLA          Timestamp
canonical-k8s  k8s-machines-contoller  localhost/localhost  3.4.2    unsupported  10:24:19+02:00

App         Version  Status   Scale  Charm       Channel      Rev  Exposed  Message
k8s                  waiting      1  k8s         latest/edge   47  no       Cluster not yet ready
k8s-worker           blocked      2  k8s-worker  latest/edge   47  no       Missing cluster integration

Unit           Workload  Agent  Machine  Public address  Ports  Message
k8s-worker/0*  blocked   idle   4        10.112.13.239          Missing cluster integration
k8s-worker/1   blocked   idle   5        10.112.13.65           Missing cluster integration
k8s/0*         waiting   idle   3        10.112.13.4            Cluster not yet ready

Machine  State    Address        Inst id               Base          AZ  Message
3        started  10.112.13.4    manual:10.112.13.4    ubuntu@22.04      Manually provisioned machine
4        started  10.112.13.239  manual:10.112.13.239  ubuntu@22.04      Manually provisioned machine
5        started  10.112.13.65   manual:10.112.13.65   ubuntu@22.04      Manually provisioned machine

Integration provider  Requirer        Interface       Type  Message
k8s:cluster           k8s:cluster     k8s-cluster     peer  
k8s:cos-tokens        k8s:cos-tokens  cos-k8s-tokens  peer

Additional context

No response

addyess commented 6 months ago

The control-plane charm doesn't require workers to be deployed? Why should the control-plane charm be blocked if not related to the worker?

beliaev-maksim commented 6 months ago

@addyess why does it say "Cluster not yet ready" and becomes ready right after relation added ?

addyess commented 6 months ago

"Cluster not yet ready" is a message stating bootstrapping is not yet completed. The charm can change state when hooks initiated by juju change the state. The cluster may not be bootstrapped the last time the k8s/0 unit ran a hook. Perhaps an update-status hook will come along in 5m and the unit will be ready.

If you integrate the charms, that would trigger an event from juju on the k8s unit (which is likely bootstrapped now)

beliaev-maksim commented 6 months ago

@addyess what if the user sets juju config to fire status hook every 5h ?

should not we be more deterministic here ?

addyess commented 6 months ago

There are no juju hooks from the underlying snap (would be cool). We can however install a systemd service on the machine which kicks the unit occasionally. We've done this kind of thing in charmed-kubernetes before

beliaev-maksim commented 6 months ago

@addyess abuse juju secrets to grab the status until the cluster is ready ?

addyess commented 6 months ago

@beliaev-maksim i'm not familiar with this "abuse"

addyess commented 6 months ago

here's an example of a systemd service definition which can kick events into the charm

beliaev-maksim commented 6 months ago

@addyess you can set expiration of the secret. Once secret expires, juju fires the hook. You bootstrap the snap, set secret, secret expires, check for status, if not yet ready

flowchart TD
    charm -->|1| snap
    snap["bootstrap the snap"]  --> secret["Set juju secret"]
    secret -->|Secret expires| juju
    juju -->|"Fires hook"| charm
   charm -->|2| check["Cluster status check"]
   check --> |"not ready"| secret
addyess commented 6 months ago

I believe we'd like to take a systemd approach of triggering juju hook events rather than abusing juju secrets for fun and profit. We discussed this with sunbeam team and they advocated avoided a round-trip to the juju controller, and just keep it on the unit.

beliaev-maksim commented 6 months ago

@addyess any reason for why round trip is bad?

addyess commented 6 months ago

There is a discussion soon including other charm developers and juju hopefully discussing a per-unit reschedule-able hook that is independent from update-status hook. Hopefully this will resolve this issue. If not, we discussed with openstack teammates regarding using systemd to watch node status and trigger an event when node status changes. This also would be acceptable.