canonical / charmed-openstack-upgrader

Automatic upgrade tool for Charmed Openstack
Apache License 2.0
4 stars 12 forks source link

Functional test fail, race condition on verify_workload_upgrade step #512

Closed jneo8 closed 1 month ago

jneo8 commented 1 month ago

action: https://github.com/canonical/charmed-openstack-upgrader/actions/runs/10068970041/job/27835338985

The test_upgrade cou upgrade is failed to upgrade mysql

jneo8 commented 1 month ago

The error message:

$ /snap/bin/cou upgrade --no-backup --no-archive --auto-approve
/snap/charmed-openstack-upgrader/x1/lib/python3.10/site-packages/paramiko/pkey.py:100: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "cipher": algorithms.TripleDES,
/snap/charmed-openstack-upgrader/x1/lib/python3.10/site-packages/paramiko/transport.py:259: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from this module in 48.0.0.
  "class": algorithms.TripleDES,
Full execution log: '/home/ubuntu/.local/share/cou/log/cou-20240724080718.log'  
Connected to 'zaza-29a3a17b6d5d' ✔
Analyzing cloud... ✔
Generating upgrade plan... \2024-07-24 08:07:19 [WARNING] Not changing the install repository of app designate-bind: None already set to cloud:focal-victoria
2024-07-24 08:07:19 [WARNING] There is no ceph-mon application. Is this a valid OpenStack cloud?
Generating upgrade plan... ✔
Upgrade cloud from 'ussuri' to 'victoria'
        Verify that all OpenStack applications are in idle state
        Control Plane principal(s) upgrade plan
                Upgrade plan for 'designate-bind' to 'victoria'
                        Upgrade software packages of 'designate-bind' from the current APT repositories
                                Ψ Upgrade software packages on unit 'designate-bind/0'
                        Upgrade 'designate-bind' from 'ussuri/stable' to the new channel: 'victoria/stable'
                        Wait for up to 300s for app 'designate-bind' to reach the idle state
                        Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0
                Upgrade plan for 'mysql-innodb-cluster' to 'victoria'
                        Upgrade software packages of 'mysql-innodb-cluster' from the current APT repositories
                                Ψ Upgrade software packages on unit 'mysql-innodb-cluster/0'
                                Ψ Upgrade software packages on unit 'mysql-innodb-cluster/1'
                                Ψ Upgrade software packages on unit 'mysql-innodb-cluster/2'
                        Change charm config of 'mysql-innodb-cluster' 'source' to 'cloud:focal-victoria'
                        Wait for up to 2400s for app 'mysql-innodb-cluster' to reach the idle state
                        Verify that the workload of 'mysql-innodb-cluster' has been upgraded on units: mysql-innodb-cluster/0, mysql-innodb-cluster/1, mysql-innodb-cluster/2

Running cloud upgrade...
Verify that all OpenStack applications are in idle state ✔
Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0 ✖
2024-07-24 08:08:18 [ERROR] Unit(s) 'designate-bind/0' did not complete the upgrade to victoria. Some local processes may still be executing; you may try re-running COU in a few minutes.
2024-07-24 08:08:18 [ERROR] See the known issues at https://canonical-charmed-openstack-upgrader.readthedocs-hosted.com/en/stable/reference/known-issues/

It failed because this step: Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0. The verify logic fire too quick after Wait for up to 300s for app 'designate-bind' to reach the idle state

I feel the retry logic for _verify_workload_upgrade function is required to avoid race-condition.

jneo8 commented 1 month ago

It also maybe that the upgrade of designate-bind if failed so the verify step is failing.

samuelallan72 commented 1 month ago

I did some testing and reproduced it locally:

Output from cou:

2024-07-25 15:42:19 [INFO] Running: Wait for up to 300s for app 'designate-bind' to reach the idle state
2024-07-25 15:42:19 [DEBUG] running step: PostUpgradeStep(Wait for up to 300s for app 'designate-bind' to reach the idle state)
Wait for up to 300s for app 'designate-bind' to reach the idle state -2024-07-25 15:42:50 [DEBUG] running all sub-steps of Wait for up to 300s for app 'designate-bind' to reach the idle state
 step sequentially
2024-07-25 15:42:50 [DEBUG] running sub-step Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0
 of Upgrade plan for 'designate-bind' to 'victoria'
        Upgrade software packages of 'designate-bind' from the current APT repositories
                Ψ Upgrade software packages on unit 'designate-bind/0'
        Upgrade 'designate-bind' from 'ussuri/stable' to the new channel: 'victoria/stable'
        Wait for up to 300s for app 'designate-bind' to reach the idle state
        Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0
 step
2024-07-25 15:42:50 [INFO] Running: Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0
2024-07-25 15:42:50 [DEBUG] running step: PostUpgradeStep(Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0)
2024-07-25 15:42:50 [DEBUG] Running 'JUJU_DISPATCH_PATH=hooks/update-status ./dispatch' on 'designate-bind/0'
Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0 /2024-07-25 15:42:51 [DEBUG] results: {'return-code': 127, 'stderr': '/tmp/juju-exec2335603103/script.sh: line 1: ./dispatch: No such file or directory\n'}
2024-07-25 15:42:51 [DEBUG] Running 'hooks/update-status' on 'designate-bind/0'
Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0 -2024-07-25 15:42:54 [DEBUG] results: {'return-code': 0, 'stdout': 'active\n'}
Verify that the workload of 'designate-bind' has been upgraded on units: designate-bind/0 ✖
2024-07-25 15:42:55 [ERROR] Unit(s) 'designate-bind/0' did not complete the upgrade to victoria. Some local processes may still be executing; you may try re-running COU in a few minutes.
2024-07-25 15:42:55 [ERROR] See the known issues at https://canonical-charmed-openstack-upgrader.readthedocs-hosted.com/en/stable/reference/known-issues/

juju status before upgrade:

❯ juju status
Model              Controller   Cloud/Region             Version  SLA          Timestamp
zaza-13e2694c3b83  serverstack  serverstack/serverstack  3.4.3    unsupported  15:40:23+09:30

App                   Version  Status  Scale  Charm                 Channel        Rev  Exposed  Message
designate-bind        9.18.28  active      1  designate-bind        ussuri/stable   98  no       Unit is ready
mysql-innodb-cluster  8.0.37   active      3  mysql-innodb-cluster  8.0/stable     133  no       Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.

Unit                     Workload  Agent  Machine  Public address  Ports  Message
designate-bind/0*        active    idle   0        10.5.0.41              Unit is ready
mysql-innodb-cluster/0*  active    idle   1        10.5.2.132             Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/1   active    idle   2        10.5.2.127             Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/2   active    idle   3        10.5.1.15              Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.

juju status after upgrade:

❯ juju status
Model              Controller   Cloud/Region             Version  SLA          Timestamp
zaza-13e2694c3b83  serverstack  serverstack/serverstack  3.4.3    unsupported  16:34:58+09:30

App                   Version  Status  Scale  Charm                 Channel          Rev  Exposed  Message
designate-bind        9.18.28  active      1  designate-bind        victoria/stable   98  no       Unit is ready
mysql-innodb-cluster  8.0.37   active      3  mysql-innodb-cluster  8.0/stable       133  no       Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.

Unit                     Workload  Agent  Machine  Public address  Ports  Message
designate-bind/0*        active    idle   0        10.5.0.41              Unit is ready
mysql-innodb-cluster/0*  active    idle   1        10.5.2.132             Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/1   active    idle   2        10.5.2.127             Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/2   active    idle   3        10.5.1.15              Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.

I think the reason is that the workload version (before and after) is not within the versions in openstack_lookup for victoria:

service       ,ussuri-lower_version,ussuri-upper_version,victoria-lower_version,victoria-upper_version
designate-bind,9.16.1              ,9.18.12             ,9.16.1                ,9.18.12

Perhaps designate-bind had some more recent patch releases, so it's no longer compatible?

samuelallan72 commented 1 month ago
❯ rmadison bind9 | grep focal
 bind9 | 1:9.16.1-0ubuntu2            | focal           | source, amd64, arm64, armhf, i386, ppc64el, riscv64, s390x
 bind9 | 1:9.18.28-0ubuntu0.20.04.1   | focal-security  | source, amd64, arm64, armhf, i386, ppc64el, riscv64, s390x
 bind9 | 1:9.18.28-0ubuntu0.20.04.1   | focal-updates   | source, amd64, arm64, armhf, i386, ppc64el, riscv64, s390x