phvalguima commented 1 year ago

I am currently running on latest Juju (v4.0-beta, checked out directly from repo, but also seen happening with Juju 2.9) and PSQL from channel 14/stable.

Environment Setup

1x node microk8s with 1.27/stable, using classic confinement.

juju deploy postgresql-k8s -n3 --channel=14/stable
juju deploy s3-integrator
juju config s3-integrator bucket="test" endpoint="localhost"
juju run s3-integrator/0 sync-s3-credentials access-key=blabla secret-key=bleble

That deployment should render a pgbackrest.conf file:

$ juju ssh --container postgresql postgresql-k8s/1 cat /etc/pgbackrest.conf
[global]
backup-standby=y
repo1-retention-full=9999999
repo1-type=s3
repo1-path=
repo1-s3-region=None
repo1-s3-endpoint=localhost
repo1-s3-bucket=test
repo1-s3-uri-style=host
repo1-s3-key=blabla
repo1-s3-key-secret=bleble
start-fast=y

[patroni-postgresql-k8s]
pg1-path=/var/lib/postgresql/data/pgdata
pg1-user=backup

Deployment looks as:

Model  Controller          Cloud/Region        Version    SLA          Timestamp
test   test-k8s-localhost  test-k8s/localhost  4.0-beta1  unsupported  16:05:10+02:00

App             Version  Status  Scale  Charm           Channel    Rev  Address         Exposed  Message
postgresql-k8s  14.7     active      3  postgresql-k8s  14/stable   73  10.152.183.170  no       
s3-integrator            active      1  s3-integrator   stable      13  10.152.183.52   no       

Unit               Workload  Agent  Address       Ports  Message
postgresql-k8s/0*  active    idle   10.1.166.153         Primary
postgresql-k8s/1   active    idle   10.1.166.151         
postgresql-k8s/2   active    idle   10.1.166.152         
s3-integrator/0*   active    idle   10.1.166.154

Reproducer

To reproduce it, it is necessary to first add a new node, in this case, used a VM from LXD:

lxc launch ubuntu:22.04 microk8s --vm
lxc exec microk8s -- bash

Then, install microk8s v1.27 and cluster it, following: https://microk8s.io/docs/clustering

Confirm the new node is Ready: kubectl get nodes

Then, confirm all nodes are set in the same node, with: kubectl get po -n <model-name> -o wide

Select one of the nodes to be evicted and generate a query, following: https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/#calling-the-eviction-api

# Get the token from ~/.kube/config
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ...
    server: ...
  name: microk8s-cluster
contexts:
- context:
    cluster: microk8s-cluster
    user: admin
  name: microk8s
current-context: microk8s
kind: Config
preferences: {}
users:
- name: admin
  user:
    token: YYYYYYYYYYYYYY <<<---------- this is the token

Generate a json with the pod name, in my case:

$ cat evict.json 
{
  "apiVersion": "policy/v1",
  "kind": "Eviction",
  "metadata": {
    "name": "postgresql-k8s-2",
    "namespace": "test"
  }
}

CURL k8s API with using the token and the other details above:

curl -k -v -H "Authorization: Bearer YYYYYYYYYYYYYY" -H 'Content-type: application/json' https://.../api/v1/namespaces/<model-name>/pods/<pod-name>/eviction -d @evict.json

That will evict the pod to a new node and renders a wrong pgbackrest.conf:

$ juju ssh --container postgresql postgresql-k8s/2 cat /etc/pgbackrest.conf
[global]
repo-path=/var/lib/pgbackrest

#[main]
#db-path=/var/lib/postgresql/9.4/main

Early Conclusions

This was focused on pgbackrest.conf, but we should confirm other config files are correctly set
That is caused b/c the pod rootfs has been removed and we do not store it in the storage device itself
Logs show that Juju actually rerun: stop / upgrade-charm / start for the new pod, keeping postgresql-k8s/2 in juju status

Full logs from postgresql/2: https://pastebin.ubuntu.com/p/CqjbhxwZvh/

github-actions[bot] commented 1 year ago

https://warthogs.atlassian.net/browse/DPE-2671

phvalguima commented 1 year ago

I am not sure which path should we follow. I see some options:

Juju issues an event whenever it detects a pod eviction happening, e.g. a new type of event or a peer-changed to all the peers
There is no way to differentiate between an eviction and a new install from the charm perspective, except for the (very likely) IP change. However, it is not 100% guaranteed the IP, a pod can be rescheduled to the same node and pick the same IP by coincidence
We can have a k8s-integrator charm, that relates to all the units and listens to kubernetes events - unrelated to Juju's - and generates a -changed to all the units it is related to

Another alternative is to try forbidding evictions altogether.

We can define a PodDisruptionBudget as follows:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: <name>
  namespace: <model-name>
spec:
  maxUnavailable: 0 ### <<<<-------- the relevant part
  selector:
    matchLabels:
      app.juju.is/created-by: <app-name>

After applying the above and rerunning the same curl command, I get now:

$ curl -k -H "Authorization: Bearer YYYYYYYYYYYYYYYYY" -H 'Content-type: application/json' https://.../api/v1/namespaces/test/pods/postgresql-k8s-2/eviction -d @evict.json
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Cannot evict pod as it would violate the pod's disruption budget.",
  "reason": "TooManyRequests",
  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget psql-pdb needs 3 healthy pods and has 3 currently"
      }
    ]
  },
  "code": 429
}

That is a standard way of informing k8s the pods should not be "disrupted". I think that has another advantage besides avoiding eviction: we cannot guarantee that the volume will be available if the pod moves between nodes. That gives more assurance we will not cause unforeseen data copy in the event of an eviction.

Trying to remove a pod via Juju works fine:

$ juju remove-unit postgresql-k8s --num-units 1
scaling down to 2 units
$ juju status
Model  Controller          Cloud/Region        Version    SLA          Timestamp
test   test-k8s-localhost  test-k8s/localhost  4.0-beta1  unsupported  18:00:07+02:00

App             Version  Status  Scale  Charm           Channel    Rev  Address        Exposed  Message
postgresql-k8s  14.7     active    3/2  postgresql-k8s  14/stable   73  10.152.183.82  no       
s3-integrator            active      1  s3-integrator   stable      13  10.152.183.52  no       

Unit               Workload  Agent      Address       Ports  Message
postgresql-k8s/0*  active    executing  10.1.166.153         (config-changed) Primary
postgresql-k8s/1   active    executing  10.1.166.151         (config-changed) 
postgresql-k8s/2   active    executing  10.1.166.155

Eventually renders:

$ juju status
Model  Controller          Cloud/Region        Version    SLA          Timestamp
test   test-k8s-localhost  test-k8s/localhost  4.0-beta1  unsupported  18:03:07+02:00

App             Version  Status  Scale  Charm           Channel    Rev  Address        Exposed  Message
postgresql-k8s  14.7     active      2  postgresql-k8s  14/stable   73  10.152.183.82  no       
s3-integrator            active      1  s3-integrator   stable      13  10.152.183.52  no       

Unit               Workload  Agent  Address       Ports  Message
postgresql-k8s/0*  active    idle   10.1.166.153         
postgresql-k8s/1   active    idle   10.1.166.151         
s3-integrator/0*   active    idle   10.1.166.154

AmberCharitos commented 7 months ago

This also effected the commercial system's production postgresql-k8s deployment channel=14/edge rev=198.

dragomirp commented 5 months ago

Latest 14/edge (rev. 241, 242) should fix this.

canonical / postgresql-k8s-operator

Pod eviction results in empty pgbackrest.conf #269

Environment Setup

Reproducer

Early Conclusions