confluentinc / cp-ansible

Ansible playbooks for the Confluent Platform
Apache License 2.0
33 stars 406 forks source link

Play must fail if URP health check fail #1041

Closed erikgb closed 2 years ago

erikgb commented 2 years ago

Describe the issue

We had something wrong in our inventory causing the URP health check to fail after a broker restart. Even if the task itself fails, there seems to be something wrong in the logic in the following tasks - allowing the play to continue. Luckily we observed the pipeline when running, so we managed to cancel the play. If not, I would suspect this bug to serially bring down the whole cluster.

This is the log for the incident:

RUNNING HANDLER [confluent.platform.kafka_broker : Restart Kafka] **************
changed: [REDACTED]
RUNNING HANDLER [confluent.platform.kafka_broker : Startup Delay] **************
ok: [REDACTED]
TASK [confluent.platform.kafka_broker : Encrypt secrets] ***********************
skipping: [REDACTED]
TASK [confluent.platform.kafka_broker : Start kafka] ***************************
ok: [REDACTED]
TASK [confluent.platform.kafka_broker : Wait for health checks to complete] ****
included: /home/ansible/.ansible/collections/ansible_collections/confluent/platform/roles/kafka_broker/tasks/health_check.yml for REDACTED
TASK [confluent.platform.kafka_broker : Get Topics with UnderReplicatedPartitions] ***
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (15 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (14 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (13 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (12 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (11 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (10 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (9 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (8 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (7 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (6 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (5 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (4 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (3 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (2 retries left).
FAILED - RETRYING: [REDACTED]: Get Topics with UnderReplicatedPartitions (1 retries left).
fatal: [REDACTED]: FAILED! => changed=false 
  attempts: 15
  cmd: |-
    /usr/bin/kafka-topics --bootstrap-server REDACTED:9091  --describe --under-replicated-partitions --command-config /etc/kafka/client.properties
  delta: '0:00:01.884496'
  end: '2022-05-09 11:15:21.421418'
  msg: ''
  rc: 0
  start: '2022-05-09 11:15:19.536922'
  stderr: ''
  stderr_lines: <omitted>
  stdout: |2-
            Topic: app_strm_topic_with_zk_acls      Partition: 0    Leader: none    Replicas: 1,2   Isr: 2
            Topic: app_strm_system_test_connect_smoketest_status    Partition: 0    Leader: none    Replicas: 2,3   Isr: 3
            Topic: app_strm_system_test_connect_smoketest_status    Partition: 1    Leader: none    Replicas: 3,1   Isr: 3
            Topic: app_strm_system_test_connect_smoketest_status    Partition: 2    Leader: none    Replicas: 1,2   Isr: 2
            Topic: app_strm_system_test_connect_smoketest_bookmarks Partition: 0    Leader: none    Replicas: 2,3   Isr: 3
            Topic: app_strm_system_test_connect_smoketest_configs   Partition: 0    Leader: none    Replicas: 1,2   Isr: 2
            Topic: app_strm_system_test_connect_smoketest_offsets   Partition: 0    Leader: none    Replicas: 3,1   Isr: 3
            Topic: app_strm_system_test_connect_smoketest_offsets   Partition: 1    Leader: none    Replicas: 1,2   Isr: 2
            Topic: app_strm_system_test_connect_smoketest_offsets   Partition: 2    Leader: none    Replicas: 2,3   Isr: 3
            Topic: app_strm_system_test_topic_A     Partition: 0    Leader: none    Replicas: 3,2   Isr: 3
            Topic: app_strm_system_test_no_one_authorized   Partition: 0    Leader: none    Replicas: 1,3   Isr: 3
            Topic: operations_quality_simple_kafka_client_test_topic_avro_v2        Partition: 0    Leader: none    Replicas: 2,1   Isr: 2
            Topic: app_strm_system_test_topic_B     Partition: 0    Leader: none    Replicas: 2,1   Isr: 2
            Topic: app_strm_system_test_stream_output       Partition: 0    Leader: none    Replicas: 1,2   Isr: 2
            Topic: app_strm_demo_topic      Partition: 0    Leader: none    Replicas: 1,2   Isr: 2
            Topic: __consumer_offsets       Partition: 0    Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 1    Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 2    Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 3    Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 4    Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 5    Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 6    Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 7    Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 8    Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 9    Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 10   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 11   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 12   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 13   Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 14   Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 15   Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 16   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 17   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 18   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 19   Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 20   Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 21   Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 22   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 23   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 24   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 25   Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 26   Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 27   Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 28   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 29   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 30   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 31   Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 32   Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 33   Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 34   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 35   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 36   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 37   Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 38   Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 39   Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 40   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 41   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 42   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 43   Leader: none    Replicas: 3,1,2 Isr: 3
            Topic: __consumer_offsets       Partition: 44   Leader: none    Replicas: 1,2,3 Isr: 3
            Topic: __consumer_offsets       Partition: 45   Leader: none    Replicas: 2,1,3 Isr: 3
            Topic: __consumer_offsets       Partition: 46   Leader: none    Replicas: 3,2,1 Isr: 3
            Topic: __consumer_offsets       Partition: 47   Leader: none    Replicas: 1,3,2 Isr: 3
            Topic: __consumer_offsets       Partition: 48   Leader: none    Replicas: 2,3,1 Isr: 3
            Topic: __consumer_offsets       Partition: 49   Leader: none    Replicas: 3,1,2 Isr: 3
  stdout_lines: <omitted>
...ignoring
TASK [confluent.platform.kafka_broker : Get Topics with UnderReplicatedPartitions with Secrets Protection enabled] ***
skipping: [REDACTED]
TASK [confluent.platform.kafka_broker : Wait for Metadata Service to start] ****
skipping: [REDACTED]
TASK [confluent.platform.kafka_broker : Wait for Embedded Rest Proxy to start] ***
skipping: [REDACTED]
TASK [confluent.platform.kafka_broker : Fetch Files for Debugging Failure] *****
skipping: [REDACTED]
TASK [confluent.platform.kafka_broker : Fail Provisioning] *********************
skipping: [REDACTED]

To Reproduce

Introduce some bug in the inventory causing the URP health check to fail. Observe that the play continues with the subsequent broker host in the cluster. The task Fail Provisioning is skipped.

Expected behaviour

The play should fail on the first broker not passing the URP health check by running the Fail Provisioning task.

Inventory File

Can provide details on request, but I expect it to be less relevant for this bug. Cluster: 3 zookeepers and 3 brokers running on dedicated hosts, serial deployment strategy.

Environment (please complete the following information):

ansible@28609a018d8d:~$ ansible --version
/usr/local/lib/python3.10/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been dep
recated
  "class": algorithms.Blowfish,
ansible [core 2.12.5]
  config file = None
  configured module search path = ['/home/ansible/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.10/site-packages/ansible
  ansible collection location = /home/ansible/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.4 (main, Apr 20 2022, 18:21:23) [GCC 10.2.1 20210110]

Additional context

I suspect the issue to be related to this logical expression. Why are we ignoring/delaying the play to error out? Would it be possible to use a handler to error out instead - to avoid duplicating the code?

CC: @nsharma-git @domenicbove

nsharma-git commented 2 years ago

Seems duplicate of https://github.com/confluentinc/cp-ansible/issues/954. Please let me know if thats not the case.