…uring build deployment

Problem: HA mini provisioning is failing because of Kafka connection. It is happening because Kafka and HA pod is getting deployed simultaneously. For that, HA needs to try to reconnect/retry. But that is not happening. Hence Kafka topics consul keys and other keys are not getting created. So, HA POD is running but not functional.

Solution: To reconnect/retry, the init container needs to be restarted because mini provisioning gets executed as part of the init container. For the init container to restart, a proper failure code must be returned to the caller. For here, the exception needs to be re-raised to the caller and there, already the error code returning is handled.

Signed-off-by: Madhura Mande madhura.mande@seagate.com

Problem Statement

https://jts.seagate.com/browse/CORTX-33787

Design

HA and third party kafka pods now gets deployed simultaneously. HA connects to kafka at its init stage(mini provisioning) for creating topics. As HA tries to connect to kafka, but that time, kafka was running but it is not ready to serve. Hence HA fails at mini provisioning stage and fails to create consul keys. For this init container needs to be restarted so that kafka connection retries will be executed. Ideally init container is meant to be executed only once. It will be restarted only if some failure occurs. And failure can be propagated in the form of return code. From HA side, the exception was not getting re-raised and that is why return code was always sent as 0 which was not causing the restart of init container. So, re-raising the exception and proper return code handling is needed.

Coding

[x] Coding conventions are followed and code is consistent

Testing

[ ] Unit and System Tests are added
[ ] Test Cases cover Happy Path, Non-Happy Path and Scalability
[x] Testing was performed with RPM https://jts.seagate.com/secure/attachment/532015/CORTX-33787_test_results.txt

Review Checklist

[x] PR is self reviewed
[x] JIRA number/GitHub Issue added to PR
[x] Jira and state/status is updated and JIRA is updated with PR link
[ ] Check if the description is clear and explained
[ ] Is there a change in filename/package/module or signature? [Y/N]:
[ ] If yes for above point, is a notification sent to all other cortx components? [Y/N]
[ ] Side effects on other features (deployment/upgrade)? [Y/N]
[ ] Dependencies on other component(s)? [Y/N]
If yes for above point, post link to the corresponding PR.

Review Checklist

[ ] Is perfline test run and the report with and without the changes updated in the PR? [Y/N]:

Documentation

Checklist for Author

[ ] Changes done to WIKI / Confluence page / Quick Start Guide

@Madhura-08, following rules are exercised in Hare commit messages,

CORTX-33787: [v0.9.0][2.0.0-880] Kafka errors UNKNOWN_TOPIC_OR_PART during build deployment

- re-raise the exception in order to properly propagate the script return code
  to caller

Signed-off-by: Madhura Mande <madhura.mande@seagate.com>

keep the summary line short (80 cols)
Its good to describe the problem a bit. From the commit message I am not able to understand what the problem was and why are we implementing the fix.
Its good to separate classify the Solution with a Solution tag.

Please consider a following commit message format,

CORTX-33787: ha deployment fails due to kafka errors

<Describe the problem, e.g. unknown topic exception not handled>

Solution:
<Describe the solution, mainly how solution fixes the problem>

Signed-off-by: Madhura Mande <madhura.mande@seagate.com>

Seagate / cortx-ha

CORTX-33787: [v0.9.0][2.0.0-880] Kafka errors UNKNOWN_TOPIC_OR_PART d… #721

Problem Statement

Design

Coding

Testing

Review Checklist

Review Checklist

Documentation