canonical / k8s-snap

Canonical Kubernetes is an opinionated and CNCF conformant Kubernetes operated by Snaps and Charms, which come together to bring simplified operations and an enhanced security posture on any infrastructure.
GNU General Public License v3.0
43 stars 12 forks source link

Scaling down K8s charm cluster causes k8s to get stuck waiting for k8sd #277

Closed amc94 closed 6 months ago

amc94 commented 7 months ago

I initially deployed 3 units of k8s then used juju to remove two, this led the last remaining k8s unit to be stuck with status waiting and the message 'waiting on k8sd'

unit-k8s-0: 13:16:15 ERROR unit.k8s/0.juju-log Caught ReconcilerError
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-0/charm/venv/charms/contextual_status.py", line 101, in on_error
    yield
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/lib/juju/agents/unit-k8s-0/charm/./src/charm.py", line 205, in _check_k8sd_ready
    self.api_manager.check_k8sd_ready()
  File "/var/lib/juju/agents/unit-k8s-0/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 659, in check_k8sd_ready
    self._send_request(endpoint, "GET", EmptyResponse)
  File "/var/lib/juju/agents/unit-k8s-0/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 559, in _send_request
    raise InvalidResponseError(
charms.k8s.v0.k8sd_api_manager.InvalidResponseError: Error status 503
    method=GET
    endpoint=/cluster/1.0/ready
    reason=Service Unavailable
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":503,"error":"Daemon is not ready yet","metadata":null}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-0/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-0/charm/./src/charm.py", line 433, in _reconcile
    self._check_k8sd_ready()
  File "/usr/lib/python3.10/contextlib.py", line 78, in inner
    with self._recreate_cm():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/var/lib/juju/agents/unit-k8s-0/charm/venv/charms/contextual_status.py", line 106, in on_error
    raise ReconcilerError(msg) from e
charms.contextual_status.ReconcilerError: Found expected exception: Error status 503
    method=GET
    endpoint=/cluster/1.0/ready
    reason=Service Unavailable
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":503,"error":"Daemon is not ready yet","metadata":null}

from systemctl snap.k8s.k8sd.service loaded active running Service for snap application k8s.k8sd

from the logs of k8sd

Mar 27 13:21:39 vault-1 systemd[1]: snap.k8s.k8sd.service: Main process exited, code=exited, status=1/FAILURE
Mar 27 13:21:39 vault-1 systemd[1]: snap.k8s.k8sd.service: Failed with result 'exit-code'.
Mar 27 13:21:39 vault-1 systemd[1]: snap.k8s.k8sd.service: Consumed 1.175s CPU time.
Mar 27 13:21:39 vault-1 systemd[1]: snap.k8s.k8sd.service: Scheduled restart job, restart counter is at 2065.
Mar 27 13:21:39 vault-1 systemd[1]: Stopped Service for snap application k8s.k8sd.
Mar 27 13:21:39 vault-1 systemd[1]: snap.k8s.k8sd.service: Consumed 1.175s CPU time.
Mar 27 13:21:39 vault-1 systemd[1]: Started Service for snap application k8s.k8sd.
Mar 27 13:21:39 vault-1 k8s.k8sd[901321]: + exec /snap/k8s/174/bin/k8sd --port=6400 --state-dir=/var/snap/k8s/common/var/lib/k8sd/state
Mar 27 13:21:39 vault-1 k8s.k8sd[901321]: time="2024-03-27T13:21:39Z" level=info msg=" - binding control socket" socket=/var/snap/k8s/common/var/lib/k8sd/state/control.socket
Mar 27 13:21:39 vault-1 k8s.k8sd[901321]: time="2024-03-27T13:21:39Z" level=info msg=" - binding https socket" network="[::]:6400"
Mar 27 13:21:39 vault-1 k8s.k8sd[901321]: time="2024-03-27T13:21:39Z" level=info msg="Stopping REST API handler - closing https socket" address="[::]:6400"
Mar 27 13:21:39 vault-1 k8s.k8sd[901321]: time="2024-03-27T13:21:39Z" level=info msg="Received shutdown signal - aborting https socket server startup"
Mar 27 13:21:39 vault-1 k8s.k8sd[901321]: time="2024-03-27T13:21:39Z" level=info msg=" - binding https socket" network="10.244.40.113:6400"
Mar 27 13:22:10 vault-1 k8s.k8sd[901321]: time="2024-03-27T13:22:10Z" level=info msg="Daemon stopped"
Mar 27 13:22:10 vault-1 k8s.k8sd[901321]: Error: Failed to run k8sd: failed to run microcluster: Unable to start daemon: Daemon failed to start: Failed to re-establish cluster connection: context deadline exceeded
bschimke95 commented 7 months ago

Hey @amc94

That happens because the juju leader node is not necessarily the the database leader/voter node. Thus, in the two node setup juju may remove the only dqlite voter node and not the follower node which keeps the cluster in a broken state. The obvious solution for this is to failover to the follower node if the last voter is removed. We already have a TODO for this https://github.com/canonical/k8s-snap/blob/main/src/k8s/pkg/client/dqlite/remove.go#L37 and will work on this in the upcoming sprint.

bschimke95 commented 7 months ago

I implemented a workaround in #289 A proper fix will be delivered by the dqlite team with #293

bschimke95 commented 7 months ago

Hey @amc94

289 is merged. Could you test if this fixes your issue?

amc94 commented 7 months ago

HI @bschimke95 I'm having difficulties bootstrapping the k8s charm on top of a maas cloud. When I try to bootstrap I see

unit-k8s-2: 16:10:38 INFO juju Starting unit workers for "k8s/2"
unit-k8s-2: 16:10:38 INFO juju.worker.apicaller [01e3d9] "unit-k8s-2" successfully connected to "10.246.165.170:17070"
unit-k8s-2: 16:10:38 INFO juju.worker.apicaller [01e3d9] password changed for "unit-k8s-2"
unit-k8s-2: 16:10:38 INFO juju.worker.apicaller [01e3d9] "unit-k8s-2" successfully connected to "10.246.165.170:17070"
unit-k8s-2: 16:10:38 INFO juju.worker.upgrader no waiter, upgrader is done
unit-k8s-2: 16:10:38 INFO juju.worker.migrationminion migration migration phase is now: NONE
unit-k8s-2: 16:10:38 INFO juju.worker.logger logger worker started
unit-k8s-2: 16:10:38 DEBUG juju.worker.dependency "metric-sender" manifold worker started at 2024-04-09 16:10:38.55295949 +0000 UTC
unit-k8s-2: 16:10:38 DEBUG juju.worker.dependency "uniter" manifold worker started at 2024-04-09 16:10:38.553329721 +0000 UTC
unit-k8s-2: 16:10:38 DEBUG juju.worker.secretsdrainworker got new secret backend
unit-k8s-2: 16:10:38 DEBUG juju.worker.secretsdrainworker no secrets to drain
unit-k8s-2: 16:10:38 DEBUG juju.worker.meterstatus got meter status change signal from watcher
unit-k8s-2: 16:10:38 DEBUG juju.machinelock acquire machine lock "machine-lock" for meterstatus (meter-status-changed)
unit-k8s-2: 16:10:38 DEBUG juju.machinelock machine lock "machine-lock" acquired for meterstatus (meter-status-changed)
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.runner starting jujuc server  {unix @/var/lib/juju/agents/unit-k8s-2/agent.socket <nil>}
unit-k8s-2: 16:10:38 DEBUG juju.worker.apiaddressupdater updating API hostPorts to [[10.246.164.114:17070 127.0.0.1:17070 [::1]:17070] [10.246.165.174:17070 127.0.0.1:17070 [::1]:17070] [10.246.165.170:17070 127.0.0.1:17070 [::1]:17070]]
unit-k8s-2: 16:10:38 ERROR juju.worker.meterstatus error running "meter-status-changed": charm missing from disk
unit-k8s-2: 16:10:38 DEBUG juju.machinelock created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
unit-k8s-2: 16:10:38 DEBUG juju.machinelock machine lock "machine-lock" released for meterstatus (meter-status-changed)
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter starting local juju-exec listener on {unix /var/lib/juju/agents/unit-k8s-2/run.socket <nil>}
unit-k8s-2: 16:10:38 INFO juju.worker.uniter unit "k8s/2" started
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter juju-exec listener running
unit-k8s-2: 16:10:38 INFO juju.worker.uniter resuming charm install
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.operation running operation install ch:amd64/jammy/k8s-27 for k8s/2
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.operation no machine lock needed for k8s/2
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.operation preparing operation "install ch:amd64/jammy/k8s-27" for k8s/2
unit-k8s-2: 16:10:38 INFO juju.worker.uniter.charm downloading ch:amd64/jammy/k8s-27 from API server
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.operation executing operation "install ch:amd64/jammy/k8s-27" for k8s/2
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.charm preparing to deploy charm "ch:amd64/jammy/k8s-27"
unit-k8s-2: 16:10:38 DEBUG juju.worker.uniter.charm deploying charm "ch:amd64/jammy/k8s-27"
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.charm finishing deploy of charm "ch:amd64/jammy/k8s-27"
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.operation committing operation "install ch:amd64/jammy/k8s-27" for k8s/2
unit-k8s-2: 16:10:39 INFO juju.worker.uniter hooks are retried true
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate starting secrets rotation watcher
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate starting secret revisions expiry watcher
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate starting obsolete secret revisions watcher
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.secretsrotate got rotate secret changes: []watcher.SecretTriggerChange{}
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got unit change for k8s/2
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.secretrevisionsexpire got revision expiry secret changes: []watcher.SecretTriggerChange{}
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got update status interval change for k8s/2: ok=true
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got storage change for k8s/2: [] ok=true
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got obsolete secret revisions change for k8s: []
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate obsolete secret revisions: map[]
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate deleted secrets: []
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got application change for k8s/2
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got secrets change for k8s/2: []
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got latest secret info: map[string]secrets.SecretRevisionInfo{}
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate deleted secrets: []
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate obsolete secrets: map[]
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got address change for k8s/2: ok=true, hashes=[02a44cd9aa7c3d65311f6116ab7e00367f885f6d90449058a589404ed6beee9d]
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got action change for k8s/2: [] ok=true
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got relations change for k8s/2: ok=true
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got config change for k8s/2: ok=true, hashes=[ecc8c064fc5d3a506b16fc17be2641816dbc54a7120152758d18e33ea225f67a]
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got leader settings change for k8s/2: ok=true
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got upgrade series change
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate no upgrade series in progress, reinitializing local upgrade series state
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.remotestate got trust config change for k8s/2: ok=true, hashes=[6e01b14c257801c5ecce384a27f90588133952c5465677d535175b4287ccea48]
unit-k8s-2: 16:10:39 INFO juju.worker.uniter.storage initial storage attachments ready
unit-k8s-2: 16:10:39 INFO juju.worker.uniter found queued "install" hook
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.operation running operation run install hook for k8s/2
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.operation acquiring machine lock for k8s/2
unit-k8s-2: 16:10:39 DEBUG juju.machinelock acquire machine lock "machine-lock" for k8s/2 uniter (run install hook)
unit-k8s-2: 16:10:39 DEBUG juju.machinelock machine lock "machine-lock" acquired for k8s/2 uniter (run install hook)
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.operation preparing operation "run install hook" for k8s/2
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.operation executing operation "run install hook" for k8s/2
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter [AGENT-STATUS] executing: running install hook
unit-k8s-2: 16:10:39 DEBUG juju.worker.uniter.runner starting jujuc server  {unix @/var/lib/juju/agents/unit-k8s-2/agent.socket <nil>}
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log ops 2.12.0 up and running.
unit-k8s-2: 16:10:39 INFO unit.k8s/2.juju-log Running legacy hooks/install.
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log ops 2.12.0 up and running.
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log Charm called itself via hooks/install.
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log Legacy hooks/install exited with status 0.
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log Using local storage: not a Kubernetes podspec charm
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log Initializing SQLite local storage: /var/lib/juju/agents/unit-k8s-2/charm/.unit-state.db.
unit-k8s-2: 16:10:39 DEBUG unit.k8s/2.juju-log Emitting Juju event install.
unit-k8s-2: 16:10:39 INFO unit.k8s/2.juju-log Applying Proxied Environment Settings
unit-k8s-2: 16:10:39 INFO unit.k8s/2.juju-log Ensuring k8s snap version
unit-k8s-2: 16:10:48 INFO unit.k8s/2.juju-log Applying K8s requirements
unit-k8s-2: 16:10:48 INFO unit.k8s/2.juju-log Check if k8ds is ready
unit-k8s-2: 16:10:48 DEBUG unit.k8s/2.juju-log Found expected exception: Error connecting to socket: /var/snap/k8s/common/var/lib/k8sd/state/control.socket
unit-k8s-2: 16:10:48 ERROR unit.k8s/2.juju-log Caught ReconcilerError
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 438, in connect
    self.sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 101, in on_error
    yield
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 208, in _check_k8sd_ready
    self.api_manager.check_k8sd_ready()
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 659, in check_k8sd_ready
    self._send_request(endpoint, "GET", EmptyResponse)
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 550, in _send_request
    connection.request(
  File "/usr/lib/python3.10/http/client.py", line 1283, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1329, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1278, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.10/http/client.py", line 1038, in _send_output
    self.send(msg)
  File "/usr/lib/python3.10/http/client.py", line 976, in send
    self.connect()
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 440, in connect
    raise K8sdConnectionError(f"Error connecting to socket: {self.unix_socket}") from e
charms.k8s.v0.k8sd_api_manager.K8sdConnectionError: Error connecting to socket: /var/snap/k8s/common/var/lib/k8sd/state/control.socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 472, in _reconcile
    self._check_k8sd_ready()
  File "/usr/lib/python3.10/contextlib.py", line 78, in inner
    with self._recreate_cm():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 106, in on_error
    raise ReconcilerError(msg) from e
charms.contextual_status.ReconcilerError: Found expected exception: Error connecting to socket: /var/snap/k8s/common/var/lib/k8sd/state/control.socket
unit-k8s-2: 16:10:48 INFO unit.k8s/2.juju-log Status context closed with: [WaitingStatus('Waiting for k8sd')]
unit-k8s-2: 16:10:48 INFO juju.worker.uniter.operation ran "install" hook (via hook dispatching script: dispatch)
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.operation committing operation "run install hook" for k8s/2
unit-k8s-2: 16:10:48 DEBUG juju.machinelock created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
unit-k8s-2: 16:10:48 DEBUG juju.machinelock machine lock "machine-lock" released for k8s/2 uniter (run install hook)
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.operation lock released for k8s/2
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.relation unit "k8s/2" (leader=true) entered scope for relation "k8s:cos-tokens"
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.relation unit "k8s/2" (leader=true) entered scope for relation "k8s:cluster"
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.operation running operation run relation-created (5; app: k8s) hook for k8s/2
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.operation acquiring machine lock for k8s/2
unit-k8s-2: 16:10:48 DEBUG juju.machinelock acquire machine lock "machine-lock" for k8s/2 uniter (run relation-created (5; app: k8s) hook)
unit-k8s-2: 16:10:48 DEBUG juju.machinelock machine lock "machine-lock" acquired for k8s/2 uniter (run relation-created (5; app: k8s) hook)
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.operation preparing operation "run relation-created (5; app: k8s) hook" for k8s/2
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.operation executing operation "run relation-created (5; app: k8s) hook" for k8s/2
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter [AGENT-STATUS] executing: running cos-tokens-relation-created hook
unit-k8s-2: 16:10:48 DEBUG juju.worker.uniter.runner starting jujuc server  {unix @/var/lib/juju/agents/unit-k8s-2/agent.socket <nil>}
unit-k8s-2: 16:10:49 DEBUG unit.k8s/2.juju-log cos-tokens:5: ops 2.12.0 up and running.
unit-k8s-2: 16:10:49 DEBUG unit.k8s/2.juju-log cos-tokens:5: Emitting Juju event cos_tokens_relation_created.
unit-k8s-2: 16:10:49 INFO unit.k8s/2.juju-log cos-tokens:5: Applying Proxied Environment Settings
unit-k8s-2: 16:10:49 INFO unit.k8s/2.juju-log cos-tokens:5: Ensuring k8s snap version
unit-k8s-2: 16:10:49 WARNING unit.k8s/2.cos-tokens-relation-created snap "k8s" has no updates available
unit-k8s-2: 16:10:49 INFO unit.k8s/2.juju-log cos-tokens:5: Applying K8s requirements
unit-k8s-2: 16:10:49 INFO unit.k8s/2.juju-log cos-tokens:5: Check if k8ds is ready
unit-k8s-2: 16:10:49 ERROR unit.k8s/2.juju-log cos-tokens:5: Invalid response while checking if cluster is bootstrapped: HTTP or Socket error   method=GET
    endpoint=/1.0/k8sd/cluster
unit-k8s-2: 16:10:49 INFO unit.k8s/2.juju-log cos-tokens:5: Using dqlite as datastore
unit-k8s-2: 16:10:49 DEBUG unit.k8s/2.juju-log cos-tokens:5: Found expected exception: Error status 500
    method=POST
    endpoint=/cluster/control
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"Failed to run post-bootstrap actions: failed to unmarshal bootstrap config: failed to unmarshal bootstrap config: invalid character 'c' looking for beginning of value","metadata":null}

unit-k8s-2: 16:10:49 ERROR unit.k8s/2.juju-log cos-tokens:5: Caught ReconcilerError
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 101, in on_error
    yield
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 238, in _bootstrap_k8s_snap
    self.api_manager.bootstrap_k8s_snap(payload)
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 671, in bootstrap_k8s_snap
    self._send_request(endpoint, "POST", EmptyResponse, body)
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 559, in _send_request
    raise InvalidResponseError(
charms.k8s.v0.k8sd_api_manager.InvalidResponseError: Error status 500
    method=POST
    endpoint=/cluster/control
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"Failed to run post-bootstrap actions: failed to unmarshal bootstrap config: failed to unmarshal bootstrap config: invalid character 'c' looking for beginning of value","metadata":null}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 474, in _reconcile
    self._bootstrap_k8s_snap()
  File "/usr/lib/python3.10/contextlib.py", line 78, in inner
    with self._recreate_cm():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 106, in on_error
    raise ReconcilerError(msg) from e
charms.contextual_status.ReconcilerError: Found expected exception: Error status 500
    method=POST
    endpoint=/cluster/control
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"Failed to run post-bootstrap actions: failed to unmarshal bootstrap config: failed to unmarshal bootstrap config: invalid character 'c' looking for beginning of value","metadata":null}
unit-k8s-2: 16:10:49 INFO unit.k8s/2.juju-log cos-tokens:5: Status context closed with: [WaitingStatus('Waiting to bootstrap k8s snap')]
unit-k8s-2: 16:10:49 INFO juju.worker.uniter.operation ran "cos-tokens-relation-created" hook (via hook dispatching script: dispatch)
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.operation committing operation "run relation-created (5; app: k8s) hook" for k8s/2
unit-k8s-2: 16:10:50 DEBUG juju.machinelock created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
unit-k8s-2: 16:10:50 DEBUG juju.machinelock machine lock "machine-lock" released for k8s/2 uniter (run relation-created (5; app: k8s) hook)
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.operation lock released for k8s/2
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.operation running operation run relation-created (4; app: k8s) hook for k8s/2
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.operation acquiring machine lock for k8s/2
unit-k8s-2: 16:10:50 DEBUG juju.machinelock acquire machine lock "machine-lock" for k8s/2 uniter (run relation-created (4; app: k8s) hook)
unit-k8s-2: 16:10:50 DEBUG juju.machinelock machine lock "machine-lock" acquired for k8s/2 uniter (run relation-created (4; app: k8s) hook)
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.operation preparing operation "run relation-created (4; app: k8s) hook" for k8s/2
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.operation executing operation "run relation-created (4; app: k8s) hook" for k8s/2
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter [AGENT-STATUS] executing: running cluster-relation-created hook
unit-k8s-2: 16:10:50 DEBUG juju.worker.uniter.runner starting jujuc server  {unix @/var/lib/juju/agents/unit-k8s-2/agent.socket <nil>}
unit-k8s-2: 16:10:50 DEBUG unit.k8s/2.juju-log cluster:4: ops 2.12.0 up and running.
unit-k8s-2: 16:10:50 DEBUG unit.k8s/2.juju-log cluster:4: Emitting Juju event cluster_relation_created.
unit-k8s-2: 16:10:50 INFO unit.k8s/2.juju-log cluster:4: Applying Proxied Environment Settings
unit-k8s-2: 16:10:50 INFO unit.k8s/2.juju-log cluster:4: Ensuring k8s snap version
unit-k8s-2: 16:10:50 WARNING unit.k8s/2.cluster-relation-created snap "k8s" has no updates available
unit-k8s-2: 16:10:50 INFO unit.k8s/2.juju-log cluster:4: Applying K8s requirements
unit-k8s-2: 16:10:50 INFO unit.k8s/2.juju-log cluster:4: Check if k8ds is ready
unit-k8s-2: 16:10:50 ERROR unit.k8s/2.juju-log cluster:4: Invalid response while checking if cluster is bootstrapped: Error status 500
    method=GET
    endpoint=/1.0/k8sd/cluster
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"failed to create k8s client: failed to build Kubernetes REST config: stat /etc/kubernetes/admin.conf: no such file or directory","metadata":null}

unit-k8s-2: 16:10:50 INFO unit.k8s/2.juju-log cluster:4: Using dqlite as datastore
unit-k8s-2: 16:10:50 DEBUG unit.k8s/2.juju-log cluster:4: Found expected exception: Error status 500
    method=POST
    endpoint=/cluster/control
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"Failed to initialize local remote entry: A remote with name \"solqa-lab1-server-10\" already exists","metadata":null}

unit-k8s-2: 16:10:50 ERROR unit.k8s/2.juju-log cluster:4: Caught ReconcilerError
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 101, in on_error
    yield
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 238, in _bootstrap_k8s_snap
    self.api_manager.bootstrap_k8s_snap(payload)
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 671, in bootstrap_k8s_snap
    self._send_request(endpoint, "POST", EmptyResponse, body)
  File "/var/lib/juju/agents/unit-k8s-2/charm/lib/charms/k8s/v0/k8sd_api_manager.py", line 559, in _send_request
    raise InvalidResponseError(
charms.k8s.v0.k8sd_api_manager.InvalidResponseError: Error status 500
    method=POST
    endpoint=/cluster/control
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"Failed to initialize local remote entry: A remote with name \"solqa-lab1-server-10\" already exists","metadata":null}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/reconciler.py", line 35, in reconcile
    self.reconcile_function(event)
  File "/var/lib/juju/agents/unit-k8s-2/charm/./src/charm.py", line 474, in _reconcile
    self._bootstrap_k8s_snap()
  File "/usr/lib/python3.10/contextlib.py", line 78, in inner
    with self._recreate_cm():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/var/lib/juju/agents/unit-k8s-2/charm/venv/charms/contextual_status.py", line 106, in on_error
    raise ReconcilerError(msg) from e
charms.contextual_status.ReconcilerError: Found expected exception: Error status 500
    method=POST
    endpoint=/cluster/control
    reason=Internal Server Error
    body={"type":"error","status":"","status_code":0,"operation":"","error_code":500,"error":"Failed to initialize local remote entry: A remote with name \"solqa-lab1-server-10\" already exists","metadata":null}

the remote solqa-lab1-server-10 is the node the k8s charm was deployed to.

The following is the ouput journal for the k8s

ubuntu@solqa-lab1-server-10:~$ journalctl -u snap.k8s.k8sd.service
Apr 09 16:10:48 solqa-lab1-server-10 systemd[1]: Started Service for snap application k8s.k8sd.
Apr 09 16:10:48 solqa-lab1-server-10 k8s.k8sd[6024]: + exec /snap/k8s/213/bin/k8sd --state-dir=/var/snap/k8s/common/var/lib/k8sd/state
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: time="2024-04-09T16:10:49Z" level=warning msg="Failed to parse new remotes from truststore"
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: time="2024-04-09T16:10:49Z" level=warning msg="microcluster database is uninitialized"
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: time="2024-04-09T16:10:49Z" level=warning msg="Failed to parse new remotes from truststore"
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: time="2024-04-09T16:10:49Z" level=info msg=" - binding control socket" socket=/var/snap/k8s/common/var/lib/k8sd/state/control.socket
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: 2024/04/09 16:10:49 http: panic serving @: runtime error: invalid memory address or nil pointer dereference
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: goroutine 35 [running]:
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: net/http.(*conn).serve.func1()
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /snap/go/10584/src/net/http/server.go:1868 +0xb9
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: panic({0x21fe7a0?, 0x3a978b0?})
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /snap/go/10584/src/runtime/panic.go:920 +0x270
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/go-dqlite/app.(*App).clientOptions(...)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/canonical/go-dqlite@v1.21.0/app/app.go:724
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/go-dqlite/app.(*App).Leader(0x28f14d0?, {0x28f1540?, 0xc00027c000?})
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/canonical/go-dqlite@v1.21.0/app/app.go:488 +0x22
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/microcluster/internal/db.(*DB).Leader(...)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/canonical/microcluster@v0.0.0-20240402231055-e8e97999a2eb/internal/db/dqlite.go:176
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/microcluster/internal/state.(*State).Leader(0xc00024b960)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/canonical/microcluster@v0.0.0-20240402231055-e8e97999a2eb/internal/state/state.go:125 +0xa6
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/k8s/pkg/k8sd/api/impl.GetClusterMembers({0x28f14d0, 0xc000402ff0}, 0x7f8966947108?)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /build/k8s/parts/k8s-binaries/build/pkg/k8sd/api/impl/k8sd.go:17 +0x35
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/k8s/pkg/k8sd/api.(*Endpoints).getClusterStatus(0x412885?, 0xc00024b960, 0x20a3120?)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /build/k8s/parts/k8s-binaries/build/pkg/k8sd/api/cluster.go:18 +0x5c
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/microcluster/internal/rest.handleAPIRequest({0xc00012d640?, 0xc00012d650?, 0x20?, 0x31?}, 0x28cb760?, {0x2284ce0?, 0x3afff08?}, 0xc00070>
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/canonical/microcluster@v0.0.0-20240402231055-e8e97999a2eb/internal/rest/rest.go:64 +0x1c8
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/canonical/microcluster/internal/rest.HandleEndpoint.func1({0x28e23a0, 0xc000712000}, 0xc00070c200)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/canonical/microcluster@v0.0.0-20240402231055-e8e97999a2eb/internal/rest/rest.go:212 +0x64f
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: net/http.HandlerFunc.ServeHTTP(0xc00070c100?, {0x28e23a0?, 0xc000712000?}, 0x7f891fcc6cc0?)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /snap/go/10584/src/net/http/server.go:2136 +0x29
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: github.com/gorilla/mux.(*Router).ServeHTTP(0xc000001740, {0x28e23a0, 0xc000712000}, 0xc00070c000)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /root/go/pkg/mod/github.com/gorilla/mux@v1.8.1/mux.go:212 +0x1c5
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: net/http.serverHandler.ServeHTTP({0xc000700060?}, {0x28e23a0?, 0xc000712000?}, 0x6?)
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /snap/go/10584/src/net/http/server.go:2938 +0x8e
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: net/http.(*conn).serve(0xc00020ad80, {0x28f1498, 0xc000700000})
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /snap/go/10584/src/net/http/server.go:2009 +0x5f4
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: created by net/http.(*Server).Serve in goroutine 52
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]:         /snap/go/10584/src/net/http/server.go:3086 +0x5cb
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: time="2024-04-09T16:10:49Z" level=info msg=" - binding https socket" network="10.246.166.252:6400"
Apr 09 16:10:49 solqa-lab1-server-10 k8s.k8sd[6024]: time="2024-04-09T16:10:49Z" level=info msg="Preparing statements for Go project \"k8s-binaries\""
bschimke95 commented 7 months ago

This looks like the issue we see in https://github.com/canonical/k8s-snap/issues/321. I'll investigate.

bschimke95 commented 6 months ago

@amc94 with https://github.com/canonical/k8s-snap/pull/353 this should be fixed. Please try again.

amc94 commented 6 months ago

it's fixed, thank you