Most of time cou unable to complete nova-compute upgrade because `ceilometer-agent-compute` is down after upgrade

valexby commented 1 month ago

Hi,

Because of the known bug LP1947585 ceilometer-agent-compute might be down in many cases after nova-compute release upgrade. Cou fails to complete upgrade in such a case, as the final juju resume action fails.

Given that the bugfix for nova-compute wasn't back-ported to Ussuri-Wallaby and the nova bug hasn't had any activity for a year, maybe we could make cou starting ceilometer-agent after the nova release upgrade if it is down. That is a natural thing to do for a human-operator upgrading a cloud.

jneo8 commented 1 month ago

Base on LP1947585, the workaround solution is sudo systemctl restart ceilometer-agent if it's not active.

Pjack commented 1 month ago

Comment from @valexby

Without this one fixed on cou side or backported on nova-compute side, managed solutions will face this issue about ~500 nodes * 4 openstack releases = ~ 2000 times during future upgrades

aieri commented 1 month ago

it looks to me like LP#1947585 was backported all the way back to ussuri, but somehow the fix isn't working for older releases. I think we should just add the workaround within COU.

jneo8 commented 1 month ago

Discuss for implementation

There is a limitation in implementation: How can I know there is a ceilometer-agent unit relate to nova-compute as a subordinate in COU?

The subordinate information is missing when we transform the origin juju status data into COU's Application class. This create a awkward situation that I am not able to confirm if the ceilometer-agent unit is there in the same machine.

There will be two options:

We follow LP1947585's implementation. We put this restart service logic in the nova-compute application's post-upgrade step. But before that we need to put the subordinate information into COU's application class.
We create a new application for ceilometer-agent and include this subordinate application's upgrade step into hypervisor. The restart logic will be in this new application.

Now I prefer option 1 because:

option 1 follow the behavior in the charm(restart service in nova-compute), see fix on nova-compute
option 2 has a lot of difficulties to overcome. Now the hypervisor still can't handle the subordinate application because of missing machine information in juju status data. This require some refactor in hypervisor step first.

Any feedback is welcome. I would start the implementation next Tuesday(6/11)

samuelallan72 commented 4 weeks ago

The subordinate information is missing when we transform the origin juju status data into COU's Application class.

Is anything stopping us from adding the subordinate information?

Your argument around going for option 1 makes sense. :+1: It seems a little strange to me that logic for controlling the services is spread over the machine charm and the subordinate, but that's how it is I guess. :thinking:

canonical / charmed-openstack-upgrader

Most of time cou unable to complete nova-compute upgrade because `ceilometer-agent-compute` is down after upgrade #427

Discuss for implementation