canonical / charmed-openstack-upgrader

Automatic upgrade tool for Charmed Openstack
Apache License 2.0
3 stars 12 forks source link

Check running ceph versions #584

Closed samuelallan72 closed 3 weeks ago

samuelallan72 commented 1 month ago

Add a pre and post upgrade warning if running ceph versions are mismatched.

We can't error on these cases, because mismatched versions are expected if the upgrade is being resumed, but we can warn the user.

Fixes: #401

samuelallan72 commented 1 month ago

This is currently a work in progress, but I'm interested in reviews on the approach before continuing. This is untested code that implements a basic warning before and after an upgrade when ceph-mon reports running versions of ceph that are mismatched. See also https://github.com/canonical/charmed-openstack-upgrader/issues/401#issuecomment-2401401387 .

samuelallan72 commented 1 month ago

Tests added, some bugs fixed, and properly ready for review now. :)

samuelallan72 commented 1 month ago

@gabrielcocenza thanks for your comments. There are a lot of factors to checking this, and I don't think there's a perfect solution unfortunately. I agree that false-positive warnings aren't ideal - it was a tradeoff to avoid too much extra complexity. We can definitely revisit it.

IMO this check should be a post upgrade check of ceph-osd.

Some questions to consider around just doing a post upgrade check of ceph-osd:

Maybe we could have a meeting to discuss where we want to go with this? It's not clear what we should go with, because it looks like we can go as simple or as complex as we like here.

gabrielcocenza commented 1 month ago

What about a pre-upgrade check to check everything is consistent before proceeding with the upgrade (eg. if cou was killed before all the ceph-osd upgrade steps completed)?

Hum, good point. If we add a pre-check for ceph-mon to force the integrity, we catch this situation. Moreover, we can halt the upgrade to force ceph versions and not just give a warning message.

Ensuring that ceph-versions are equal before ceph-mon can guarantee that the cloud is into consistent state. Checking after ceph-osd will also ensure consistency. Everything that it's between ceph-mon and ceph-osd upgrade shouldn't be checked.

What about the case where we have multiple ceph-osd applications related to ceph-mon?

Yeah, that is a corner case, but we can add a post-upgrade check if the upgrade is targeting the whole cloud or data-plane. This way a check wouldn't be added if is targeted to upgrade control-plane or hypervisors

This isn't specific to ceph-osd - what about checking for ceph-radosgw versions for example?

Right, that is why we should check before ceph-mon and after ceph-osd. radosgw shouldn't have different versions on those scenarios

jneo8 commented 1 month ago

Hum, good point. If we add a pre-check for ceph-mon to force the integrity, we catch this situation. Moreover, we can halt the upgrade to force ceph versions and not just give a warning message.

Ensuring that ceph-versions are equal before ceph-mon can guarantee that the cloud is into consistent state. Checking after ceph-osd will also ensure consistency. Everything that it's between ceph-mon and ceph-osd upgrade shouldn't be checked.

Agree that we should consider the context. The problem is kind of similar to this issue: https://github.com/canonical/charmed-openstack-upgrader/issues/561

IMO, COU don't have a mechanism to target the correct state during the middle of the upgrading. Now it's rely on engineer's personal experience. Adding this mechanism now is a huge effort job. I agree too much warning may interfere user but it's better than nothing. I suggest we discuss how to detect the state first then move on to follow your proposal.

For this PR I suggest we can merge it first because I can see the cloud operator will choose to show the more information, including warning, instead of nothing.

gabrielcocenza commented 1 month ago

What would be the worst case scenario? Some ceph-mon units upgrades and some not? Than I still believe the check is useful. Instead of raising mismatch, it will say that you have different ceph versions. Once all ceph-mon units are in the same version COU would go ahead to the next applications

jneo8 commented 1 month ago

What would be the worst case scenario? Some ceph-mon units upgrades and some not? Than I still believe the check is useful. Instead of raising mismatch, it will say that you have different ceph versions. Once all ceph-mon units are in the same version COU would go ahead to the next applications

The check you mention here is the last step in ceph-osd upgrade or the verify check before any upgrade?

Pjack commented 1 month ago

Do these two solutions conflict with each other, so we have to choose only one?

Solution A (Samuel's PR): Add a check in the sanity check to provide an earlier warning message to the user. These warnings may occasionally be false alarms, but from my perspective, this isn’t a major issue. The operators can choose to ignore it and continue the upgrade if they are confident it's not an issue.

Solution B (Samuel+Gabriel's proposal) : Add a check in the pre-check/post-check to enforce consistency between ceph_mon and ceph_osd units. We need further discussion to determine if this is a reasonable constraint. For example, do we want to enforce consistency every time, no matter it's caused by manual intervention or a bug somewhere? Personally, I vote 'yes,' but operators may be frustrated.

If there is no strong dependency between these two solutions, why not implement Solution A first and then Solution B later? It’s better to have something than nothing.

gabrielcocenza commented 1 month ago

The check you mention here is the last step in ceph-osd upgrade or the verify check before any upgrade?

I see two potential places to implement:

1 - Pre-upgrade step on ceph-mon. Ceph-mon is the first application that should be upgraded and it doesn't matter the state of the cloud, before upgrading ceph-mon all ceph applications should be at the same ceph version. I don't see side effects in this implementation. If for some reason a ceph-mon unit doesn't upgrade it will raise mismatched and manual intervention is required as it happens with any other application. After ceph-mon is upgraded you'll have a cloud that is on half way to upgrade and similar checks starts having false positives.

2 - General post-upgrade check that happens when upgrading the whole cloud or data-plane. This require a little bit more business logic and should happen as last step of upgrading a cloud.

Implementing both would be double protection. E.g: user upgrade from Ussuri -> Victoria. We would check before starting upgrading to Victoria on ceph-mon and then check again by the end when all data-plane apps are upgraded to Victoria.

Having both would be nice, but having just option 1 is enough IMO. I don't see the need of major refactors and the check is more precise without false positives.

samuelallan72 commented 1 month ago

Thanks for the discussion above, this has been helpful!

@gabrielcocenza

Pre-upgrade step on ceph-mon. Ceph-mon is the first application that should be upgraded...

General post-upgrade check that happens when upgrading the whole cloud or data-plane.

Agreed, these make sense and I think are doable without much refactoring.

I'll plan to make the following changes to this PR:

For later on, it would be nice to have a stand-alone method to check the cloud (eg. cou verify|lint|check - in the case where we want to check cloud consistency or suitability for upgrade, without actually planning an upgrade). Then we can add a check to warn on mismatched running ceph versions there. But here, I do see the downsides of a pre-plan warning because of the high chance of false positives.

jneo8 commented 1 month ago

Thanks for clarify. @gabrielcocenza @samuelallan72

The only concern I have is that we're missing a case:

What happens if Ceph is already running on different versions? Should we allow the user to upgrade any applications in this scenario?

While the Pre-upgrade step on ceph-mon halts the Ceph upgrade, it doesn’t address other applications. We can't assume the user has the necessary information from the previous General post-upgrade check. If they’re unaware that Ceph is running on different versions, they might proceed with the upgrade without a warning.

Proposal:

We introduce a prompt in such cases, requiring the user to type yes to continue if the Ceph versions are inconsistent.

This wouldn't be a warning but would still provide the user with crucial information and seek explicit confirmation before proceeding.

samuelallan72 commented 1 month ago

Good point, I'll hold off on making changes right now.

samuelallan72 commented 4 weeks ago

@gabrielcocenza @jneo8 what do you think about going back to basics here - the original issue was that ceph-osd running version wasn't updating (sounds like ceph-osd daemon wasn't being restarted). So adding a check to ceph-osd after the upgrade may be the neatest option. Thoughts?

Of course this means that cou won't re-run the verification if cou crashes and is restarted, but then we're back to the discussion here: how to implement a pre-plan check without too many false positives.

gabrielcocenza commented 4 weeks ago

What happens if Ceph is already running on different versions? Should we allow the user to upgrade any applications in this scenario? While the Pre-upgrade step on ceph-mon halts the Ceph upgrade, it doesn’t address other applications. We can't assume the user has the necessary information from the previous General post-upgrade check. If they’re unaware that Ceph is running on different versions, they might proceed with the upgrade without a warning.

I didn't understand this. If we halt on ceph-mon how the user is going to proceed?

Ceph-mon is the first application to upgrade, so if the user finds that ceph is running different versions COU will block upgrades until this is fixed. What is the issue here?

gabrielcocenza commented 4 weeks ago

So adding a check to ceph-osd after the upgrade may be the neatest option. Thoughts?

It's not just ceph-osd that I found this issue. It also can happen on cepg-radosgw

I'm not following the concerns of halting the upgrade on ceph-mon if we have different ceph versions