confluentinc / cp-ansible

Ansible playbooks for the Confluent Platform
Apache License 2.0
41 stars 405 forks source link

Suggestion: Support upgrade in the desired state playbooks #591

Closed erikgb closed 3 years ago

erikgb commented 3 years ago

The current imperative approach to upgrade Kafka (upgrade_*.yml and upgrade_kafka_broker_log_format.yml) does not fit very well in our GitOps setup. I would prefer if the upgrade could be handled in a more declarative manner - by the desired state playbooks.

First a little bit about our GitOps setup:

What I would like to see, is that to upgrade a Kafka environment, I ideally just had to do two things:

  1. Prepare the environment inventory for the new cp-ansible version.
  2. Change our import_playbook tasks to refer to the desired state playbooks in the new cp-ansible version submodule (or Ansible collection).

Looking at the current upgrade playbooks in cp-ansible and the Kafka upgrade documentation, I just see a few delicate matters:

Disclaimer: I am not a Kafka expert, so this might be more complicated than I imagine. 😄

My suggestion:

domenicbove commented 3 years ago

@erikgb As always, you have great suggestions! I've been thinking the same thing- what makes the upgrade different than just a rolling reconfiguration?

Is your idea to get rid of the upgrade playbooks altogether? Technically, right now, you could do an upgrade by just rerunning the install over and over (because we have rolling reconfiguration). Like so:

Step 1 Base installation, set these vars:

confluent_package_version: 6.0.2
kafka_broker_custom_properties:
  log.message.format.version: 2.6
  inter.broker.protocol.version: 2.6

Step 2: Upgrade packages, set these vars

confluent_package_version: 6.1.0
kafka_broker_custom_properties:
  log.message.format.version: 2.6
  inter.broker.protocol.version: 2.6

Step 3: Update the log message prop

confluent_package_version: 6.1.0
kafka_broker_custom_properties:
  log.message.format.version: 2.6
  inter.broker.protocol.version: 2.7

When clients are ready update log.message.format.version

confluent_package_version: 6.1.0
kafka_broker_custom_properties:
  log.message.format.version: 2.7
  inter.broker.protocol.version: 2.7
erikgb commented 3 years ago

@domenicbove Yes, I think as long as an upgrade does not require any special handling, I think it should be included in the (normal) desired state playbooks. New major releases, like the forthcoming elimination of Zookeeper, might require some additional "migration" playbooks to be run. But Ansible being a desired state framework, I think as much as possible should be declaratively expressed!

What you say might be teoretically possible is extremely interesting for us! I am going to try it out as soon as possible! 👍 Taking control of the confluent_package_version, kafka_broker_custom_properties.log.message.format.version and kafka_broker_custom_properties.inter.broker.protocol.version variables in our inventory makes a lot of sense to us. I will let you know how it goes, and try to descibe what I did. Might be valuable input to some future updated upgrade documentation....

I would also like to see some tests of an upgrade "run" - at least to verify upgrade from the previous version. But as long as https://github.com/confluentinc/cp-ansible/issues/551 remains open, which has been bugging me too, I do not see how I can eventually contribute to that. Such a test might give valuable input to a "migration guide" for cp-ansible versions.

erikgb commented 3 years ago

@domenicbove We have now successfully upgraded from CP 6.0 to 6.1 in all our clusters, just using the desired state playbooks. Here is a copy of our "upgrade notes" taken along the way. It might represent valid input to the eventual desired-state-upgrade.

  1. Use variables to lock to old version

    confluent_package_version: 6.0.2
    confluent_repo_version: 6.0
    kafka_broker_custom_properties:
      log.message.format.version: 2.6
      inter.broker.protocol.version: 2.6
  2. Switch to new version branch of cp-ansible

  3. Issue with initial serial deployment strategy. The task 'Get Controller Broker ID' is failing because of missing {{kafka_broker.zookeeper_tls_client_config_file}} file. Had to run once with parallel deployment strategy.

  4. Use variables to upgrade to new version. Set certs_updated to force a (rolling) restart. If not set, the packages are updated, but no restart of services.

    confluent_package_version: 6.1.0
    confluent_repo_version: 6.1
    certs_updated: true

    New package repo for new version might require us to set allow_failure: true on the check mode job.

  5. Use variables to upgrade to new inter broker protocol version. Unset/remove certs_updated to avoid forcing a (rolling) restart. A restart of brokers will be triggered by the change in protocol version.

    kafka_broker_custom_properties:
      inter.broker.protocol.version: 2.7
    certs_updated: false

Next (and last) step is to set log.message.format.version: 2.7, but we want to see how the upgrade affected our clients first.

domenicbove commented 3 years ago

Hey all of the issues sound like they can be handled with small code changes!

To avoid the need for " Set certs_updated to force restart", we should put a handler on the yum/apt tasks... need to think about the tar install. could do a similar thing like set another force restart var.

A customer recently brought up the same complaint about the upgrades not being idempotent so I appreciate the testing! As far as gitops goes this is far superior. It does make the upgrades more of an involved process... but sometimes it feels better to be involved in the upgrades, then to let a long script run and run. Im questioning what to do about the current upgrade playbooks.

erikgb commented 3 years ago

Hey all of the issues sound like they can be handled with small code changes!

@domenicbove That sounds good! What do you think is the best way to approach the necessary changes? Since it might involve adding new variables, I would prefer if you could suggest the changes, and I can review and test. WDYT?

A customer recently brought up the same complaint about the upgrades not being idempotent so I appreciate the testing! As far as gitops goes this is far superior. It does make the upgrades more of an involved process... but sometimes it feels better to be involved in the upgrades, then to let a long script run and run. Im questioning what to do about the current upgrade playbooks.

I think it should be possible to retain the upgrade playbooks with same/similiar functionality as now, but I would prefer to reuse as much as possible from the desired state upgrade path. That will allow both approaches to be supported - at least for some time.

Another update: We are now trying to upgrade our 5.4 clusters to 6.1, and that faced me with an additional challenge. The list of packages have changed between these two versions, and there is currently no way to override the package list variables in your inventory - since they are defined as role variables and not role variable defaults. Is that intentionally? Since a user eventually can override the variables confluent_package_version and confluent_repo_version, it might make sense to override the packages lists as well. At least if you know what you are doing.... WDYT?

domenicbove commented 3 years ago

@erikgb You are right that the package lists are not configurable, but can't you just switch cp-ansible branches to gain the correct list?

Another aspect of the package upgrades is potentially removing packages before installing the new ones: https://github.com/confluentinc/cp-ansible/blob/6.1.0-post/upgrade_kafka_broker.yml#L231

The confluent-kafka package changed names at some point. I could see a task file packages.yml within each component role that first queries what packages are currently installed, and then properly installs the packages (as needed) and makes sure to trigger the restart handler as needed.

Also now that i think of it if you are switching one tar version to another then the override.conf file will be updated- triggering the restart handler.

erikgb commented 3 years ago

@domenicbove What I would like to do, is to switch to a newer version (branch) of cp-ansible before I go for the newer version of Confluent Platform (controlled by my inventory). In some cases, like when upgrading from 5.4 to 6.1, that would not possible because of the package lists. Of course, it is a bit risky to have different versions of cp-ansible and CP, but that is just a temporary situation. That we could, and probably should, avoid....

domenicbove commented 3 years ago

Well you aren't the only one requesting that we make that list configurable. But in your use case you could also use --skip-tags package... Would that work?

I don't see too big of a deal making that list configurable though to appease peoples requests. I initially did not want to let people mess with it. Another idea is to keep lists that match each confluent_package_version.

erikgb commented 3 years ago

Well you aren't the only one requesting that we make that list configurable. But in your use case you could also use --skip-tags package... Would that work?

I would like to avoid additional command-line options for running the playbooks. It is hard enough already to manage this "beast" with GitOps. 😄 IMHO most things should be possible to configure in the inventory. I think that is the cleanest solution.

I don't see too big of a deal making that list configurable though to appease peoples requests. I initially did not want to let people mess with it. Another idea is to keep lists that match each confluent_package_version.

If you do not want to expose the package lists as variables that can be overridden, maintaining the CP version (or package repo version) to package lists could be a working option. But I would not perfer that - if I were a maintainer in this project.... May I ask why you seem relucutant to allow users to override this list? As written in my comment above:

.... At least if you know what you are doing....

domenicbove commented 3 years ago

@erikgb I just created this pr for you: https://github.com/confluentinc/cp-ansible/pull/607/files

I tested a 5.4.0 -> 6.1.0 upgrade by

  1. First installing 5.4.0
  2. Checking out to 6.1.x and ran the install with these vars

        deployment_strategy: rolling
        kafka_broker_custom_properties:
          log.message.format.version: 2.4
          inter.broker.protocol.version: 2.4
    
        kafka_broker_packages:
          - confluent-common
          - confluent-rest-utils
          - confluent-metadata-service
          - confluent-server
          - confluent-rebalancer
          - confluent-security
    
        confluent_package_version: 5.4.0
        confluent_repo_version: 5.4
        kafka_broker_rest_proxy_enabled: false
  3. Then I removed these vars and ran the install to update the packages. It also turned on kafka's rest api

        kafka_broker_packages:
          - confluent-common
          - confluent-rest-utils
          - confluent-metadata-service
          - confluent-server
          - confluent-rebalancer
          - confluent-security
    
        confluent_package_version: 5.4.0
        confluent_repo_version: 5.4
        kafka_broker_rest_proxy_enabled: false
  4. Finally I updated inter.broker.protocol.version and ran the install
        kafka_broker_custom_properties:
          log.message.format.version: 2.4
          inter.broker.protocol.version: 2.7

Thinking about it more, i think step 2 and 3 can be combined. You can update the packages, but leave those protocols at 2.4 all with one rolling restart.

LMK what you think of the PR.

The one thing to mention is when you checkout to 6.1.x and run the install, you get the 6.1.x properties/logic not the 5.4.0 (even if you pin the packages to a 5.4.0 list). So your config files will get updated. I think bc of this its probably best to combine step 2 and 3. IE update packages to 6.1.0 and update configs, but lock the protocol versions to 5.4.0 in same rolling restart

erikgb commented 3 years ago

Thanks a lot, @domenicbove! Best support ever! 😄 I will review the PR over the weekend and try out a few variations!

erikgb commented 3 years ago

Resolved by https://github.com/confluentinc/cp-ansible/pull/611