canonical / k8s-operator

Machine charm for K8s following the operator framework
Apache License 2.0
4 stars 1 forks source link

Unit in error after teardown #74

Open beliaev-maksim opened 5 months ago

beliaev-maksim commented 5 months ago

Bug Description

Unit goes into unrecoverable error

To Reproduce

  1. deploy CP
  2. deploy 2 worker units
  3. relate
  4. scale CP to 0 units
  5. scale CP to 1 unit

Environment

edge

Relevant log output

unit-k8s-1: 17:07:02 ERROR unit.k8s/1.juju-log cluster:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-1/charm/./src/charm.py", line 596, in <module>
    ops.main.main(K8sCharm)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/./src/charm.py", line 476, in _reconcile
    self._revoke_cluster_tokens()
  File "/var/lib/juju/agents/unit-k8s-1/charm/./src/charm.py", line 304, in _revoke_cluster_tokens
    self.distributor.revoke_tokens(
  File "/var/lib/juju/agents/unit-k8s-1/charm/src/token_distributor.py", line 316, in revoke_tokens
    token_strat(hostname, token_type)
  File "/var/lib/juju/agents/unit-k8s-1/charm/src/token_distributor.py", line 177, in _revoke_cluster_token
    self.api_manager.remove_node(name)
  File "/var/lib/juju/agents/unit-k8s-1/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 621, in remove_node
    self._send_request(endpoint, "POST", EmptyResponse, body)
  File "/var/lib/juju/agents/unit-k8s-1/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 564, in _send_request
    raise InvalidResponseError(
charms.k8s.v0.k8sd_api_manager.InvalidResponseError: Error status 500
    method=POST
    endpoint=/1.0/k8sd/cluster/remove
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"node \"k8s-worker-2\" is not part of the cluster","metadata":null}

unit-k8s-1: 17:07:02 ERROR juju.worker.uniter.operation hook "cluster-relation-created" (via hook dispatching script: dispatch) failed: exit status 1
unit-k8s-1: 17:07:02 INFO juju.worker.uniter awaiting error resolution for "relation-created" hook
unit-k8s-worker-0: 17:07:03 ERROR unit.k8s-worker/0.juju-log cluster:2: Failed to get labels. Will retry.: b'Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")\n'
unit-k8s-worker-1: 17:07:03 ERROR unit.k8s-worker/1.juju-log cluster:2: Failed to get labels. Will retry.: b'Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")\n'
unit-k8s-worker-0: 17:07:04 ERROR unit.k8s-worker/0.juju-log cluster:2: Failed to get labels. Will retry.: b'Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes-ca")\n'

Additional context

No response

eaudetcobello commented 5 months ago

Related: https://github.com/canonical/k8s-operator/issues/75

addyess commented 5 months ago

Sounds like the workers need to remember the first cluster they ever joined (maybe with the cluster-name?) such that when the cluster dies (b/c someone nukes the control-plane) -- they should go into a permanently blocked state Blocked: Awaiting juju destruction of unit

even if a new CP unit shows up, too bad -- we're not going to try to engage this machine in that new cluster.

beliaev-maksim commented 5 months ago

I still see the same issue with the latest revision

unit-k8s-1: 10:34:55 ERROR unit.k8s/1.juju-log cos-tokens:0: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-1/charm/./src/charm.py", line 744, in <module>
    ops.main.main(K8sCharm)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/./src/charm.py", line 570, in _reconcile
    self._revoke_cluster_tokens(event)
  File "/var/lib/juju/agents/unit-k8s-1/charm/./src/charm.py", line 369, in _revoke_cluster_tokens
    self.distributor.revoke_tokens(
  File "/var/lib/juju/agents/unit-k8s-1/charm/src/token_distributor.py", line 361, in revoke_tokens
    token_strat(node, ignore_errors)
  File "/var/lib/juju/agents/unit-k8s-1/charm/src/token_distributor.py", line 178, in _revoke_cluster_token
    self.api_manager.remove_node(name)
  File "/var/lib/juju/agents/unit-k8s-1/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 706, in remove_node
    self._send_request(endpoint, "POST", EmptyResponse, body)
  File "/var/lib/juju/agents/unit-k8s-1/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 651, in _send_request
    raise InvalidResponseError(
charms.k8s.v0.k8sd_api_manager.InvalidResponseError: Error status 500
    method=POST
    endpoint=/1.0/k8sd/cluster/remove
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"node \"k8s-worker-1\" is not part of the cluster","metadata":null}
maksim@darmbeliaev:~$ juju status
Model          Controller              Cloud/Region         Version  SLA          Timestamp
canonical-k8s  k8s-machines-contoller  localhost/localhost  3.4.2    unsupported  10:36:36+02:00

App         Version  Status  Scale  Charm       Channel      Rev  Exposed  Message
k8s                  error       1  k8s         latest/edge   47  no       hook failed: "cos-tokens-relation-created"
k8s-worker  1.30.0   active      2  k8s-worker  latest/edge   47  no       Ready

Unit           Workload  Agent  Machine  Public address  Ports  Message
k8s-worker/0*  active    idle   4        10.112.13.239          Ready
k8s-worker/1   active    idle   5        10.112.13.65           Ready
k8s/1*         error     idle   6        10.102.2.2             hook failed: "cos-tokens-relation-created"

Machine  State    Address        Inst id               Base          AZ  Message
4        started  10.112.13.239  manual:10.112.13.239  ubuntu@22.04      Manually provisioned machine
5        started  10.112.13.65   manual:10.112.13.65   ubuntu@22.04      Manually provisioned machine
6        started  10.102.2.2     juju-d1d519-6         ubuntu@22.04      Running
beliaev-maksim commented 5 months ago

try to reproduce on the local machine