re-run of the command "deepsea stage run ceph.maintenance.upgrade" reboots even though no reboot is required anymore

SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.

GNU General Public License v3.0

162 stars 75 forks source link

re-run of the command "deepsea stage run ceph.maintenance.upgrade" reboots even though no reboot is required anymore #1270

Open Martin-Weiss opened 6 years ago

Martin-Weiss commented 6 years ago

Description of Issue/Question

When running into an issue with "deepsea stage run ceph.maintenance.upgrade" and later re-running the command "deepsea stage run ceph.maintenance.upgrade" - the nodes in the cluster are rebooted even though a reboot is not required because the reboot was done already in the previous run.

The deepsea stage run ceph.maintenance.upgrade should check if a reboot of a node is really required and only reboot the node in case it is required. In case it is not required - no reboot should be initiated.

Versions Report

0.8.5

jschmid1 commented 6 years ago

The deepsea stage run ceph.maintenance.upgrade should check if a reboot of a node is really required and only reboot the node in case it is required. In case it is not required - no reboot should be initiated.

It does. It actually uses the same mechanism to check if a node needs a reboot as stage.0

Martin-Weiss commented 6 years ago

It does. It actually uses the same mechanism to check if a node needs a reboot as stage.0

Then this seems to be broken. Did you test to run the upgrade orchestration multiple times? In our case it reboots all the servers over and over again.

jschmid1 commented 6 years ago

can you post the output of:

salt 'the_node_in_question' grains.get kernelrelease

and

salt 'the_node_in_question' cmd.run 'rpm -q --last kernel-default | head -1'

Martin-Weiss commented 6 years ago

salt '*' grains.get kernelrelease

   4.4.132-94.33-default

salt '*' cmd.run 'rpm -q --last kernel-default | head -1'

   kernel-default-4.4.132-94.33.1.x86_64         Wed Aug 15 08:21:14 2018

Could it be the ".1." causing the problem?

jschmid1 commented 6 years ago

Yes, it seems that way.

Martin-Weiss commented 6 years ago

Am 15.08.2018 um 16:53 schrieb Joshua Schmid notifications@github.com:

Yes, it seems that way.

Ok - can this be fixed? It seems that this also might hit stage.0 then for all deployments..

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jschmid1 commented 6 years ago

We have a alternating implementation in master, currently evaluating if its worth to backport this.

jschmid1 commented 6 years ago

We have a alternating implementation in master, currently evaluating if its worth to backport this.

This should be backported. Included in this commit: https://github.com/SUSE/DeepSea/commit/bb032916e22a73d7afe216f6fd114eb6b80326cc#diff-e87ed67149dac86de0d18a896ad7b87d

Increased priority.