Seagate / cortx-ha

CORTX ha (High-Availability) is responsible for ensuring that CORTX Solution is available in case of any hardware component or software service failures. It takes care of failover/ failback control flow for affected services and stabilizes them across CORTX cluster.
https://github.com/Seagate/cortx
GNU Affero General Public License v3.0
4 stars 45 forks source link

CORTX-33787: [v0.9.0][2.0.0-880] Kafka errors UNKNOWN_TOPIC_OR_PART d… #721

Closed Madhura-08 closed 2 years ago

Madhura-08 commented 2 years ago

…uring build deployment

Problem: HA mini provisioning is failing because of Kafka connection. It is happening because Kafka and HA pod is getting deployed simultaneously. For that, HA needs to try to reconnect/retry. But that is not happening. Hence Kafka topics consul keys and other keys are not getting created. So, HA POD is running but not functional.

Solution: To reconnect/retry, the init container needs to be restarted because mini provisioning gets executed as part of the init container. For the init container to restart, a proper failure code must be returned to the caller. For here, the exception needs to be re-raised to the caller and there, already the error code returning is handled.

Signed-off-by: Madhura Mande madhura.mande@seagate.com

Problem Statement

https://jts.seagate.com/browse/CORTX-33787

Design

HA and third party kafka pods now gets deployed simultaneously. HA connects to kafka at its init stage(mini provisioning) for creating topics. As HA tries to connect to kafka, but that time, kafka was running but it is not ready to serve. Hence HA fails at mini provisioning stage and fails to create consul keys. For this init container needs to be restarted so that kafka connection retries will be executed. Ideally init container is meant to be executed only once. It will be restarted only if some failure occurs. And failure can be propagated in the form of return code. From HA side, the exception was not getting re-raised and that is why return code was always sent as 0 which was not causing the restart of init container. So, re-raising the exception and proper return code handling is needed.

Coding

Testing

Review Checklist

Review Checklist

Documentation

Checklist for Author

mssawant commented 2 years ago

@Madhura-08, following rules are exercised in Hare commit messages,

CORTX-33787: [v0.9.0][2.0.0-880] Kafka errors UNKNOWN_TOPIC_OR_PART during build deployment

- re-raise the exception in order to properly propagate the script return code
  to caller

Signed-off-by: Madhura Mande <madhura.mande@seagate.com>
  1. keep the summary line short (80 cols)
  2. Its good to describe the problem a bit. From the commit message I am not able to understand what the problem was and why are we implementing the fix.
  3. Its good to separate classify the Solution with a Solution tag.

Please consider a following commit message format,

CORTX-33787: ha deployment fails due to kafka errors

<Describe the problem, e.g. unknown topic exception not handled>

Solution:
<Describe the solution, mainly how solution fixes the problem>

Signed-off-by: Madhura Mande <madhura.mande@seagate.com>