banzaicloud / banzai-charts

Curated list of Banzai Cloud Helm charts used by the Pipeline Platform
Apache License 2.0
367 stars 278 forks source link

[Cadence][Bug]Template default to secondary cluster which will not work #1275

Closed longquanzheng closed 3 years ago

longquanzheng commented 3 years ago

Describe the bug

According to conversions with several Cadence users, what is applied into k8s by default is


clusterMetadata:
  enableGlobalDomain: {{ default .Env.ENABLE_GLOBAL_DOMAIN "false" }}
  failoverVersionIncrement: 10
  masterClusterName: "primary"
  currentClusterName: "secondary"
  clusterInformation:
    primary:
      enabled: true
      initialFailoverVersion: 0
      rpcName: "cadence-frontend"
      rpcAddress: {{ default .Env.PRIMARY_FRONTEND_SERVICE "cadence" }}:7933
    secondary:
        enabled: true
        initialFailoverVersion: 2
        rpcName: "cadence-frontend"
        rpcAddress: {{ default .Env.SECONDARY_FRONTEND_SERVICE "cadence-secondary" }}:7933

This will not work because only master cluster can register domain. Moreover, by default we expect to see Cadence enable global domain with a single cluster, which should be like:

clusterMetadata:
  enableGlobalDomain: true
  failoverVersionIncrement: 10
  masterClusterName: "primary"
  currentClusterName: "primary"
  clusterInformation:
    primary:
      enabled: true
      initialFailoverVersion: 0
      rpcName: "cadence-frontend"
      rpcAddress: {{ default .Env.PRIMARY_FRONTEND_SERVICE "cadence" }}:7933

Steps to reproduce the issue: Apply the current helmcart of latest master commit https://github.com/banzaicloud/banzai-charts/commit/335e56b03e222835a76c03156a9e46a009d17f79 into K8s.

Expected behavior

by default we expect to see Cadence enable global domain with a single cluster, which should be like:

clusterMetadata:
  enableGlobalDomain: true
  failoverVersionIncrement: 10
  masterClusterName: "primary"
  currentClusterName: "primary"
  clusterInformation:
    primary:
      enabled: true
      initialFailoverVersion: 0
      rpcName: "cadence-frontend"
      rpcAddress: {{ default .Env.PRIMARY_FRONTEND_SERVICE "cadence" }}:7933

Screenshots

Screen Shot 2021-07-30 at 9 54 31 AM

Additional context https://uber-cadence.slack.com/archives/CL22WDF70/p1627655351007800?thread_ts=1627468874.236400&cid=CL22WDF70

pregnor commented 3 years ago

The reason why you see the enableGlobalDomain: false setting and subsequently the secondary cluster being set up, is because even though I changed the default enableGlobalDomain value to true in #1269 (based on the request #1267), that change has not been released into any chart versions yet.

My understanding was that no release (no version) of Cadence server contains the changing of the default value of enableGlobalDomain yet (merge commit of the server-side default change is not part of any tags yet, the change was made on June 25th, the last tag (0.21.3) was released on June 17th). I expected the change to make it to the next upcoming release (0.21.4, 0.22.0 or 1.0.0, whichever happens first) and I wanted to release the chart's default enableGlobalDomain value change in sync with the server-side default value change to avoid differing (default) behavior between the server and the chart.

If we should change the chart default value immediately we can do that with simply releasing a new chart version, it just wasn't clear to me in #1267 the expectation was an immediate release (because intuitively I would have synced it with the server-side change).

longquanzheng commented 3 years ago

I see. Yeah I wasn’t clear on that issue about that. And usually we should wait on the commit to be released since it’s mentioned there.

I confirm that In this case, you can go ahead to release the chart without the commit being released. That’s because the actually functioning change in the commit is to change the CLI to provide global domain flag. And the server just emit a warn if not creating global domains.

Usually Cadence project would like to release server minor version at least every two months. However, they told me the long running stress tests identified some new issues and they are working on fixing. So I am not sure when this release would come . though I have seen they submitted some bug fixes recently.

pregnor commented 3 years ago

I found another aspect of the issue (why the [lack of] environment variable is not correctly handled in template conditions) and traced this part back to #1244 (chart 0.18.0), where I incorrectly ported the then new Cadence server dockerize templates to the Helm chart templates.

The conditions were set on a raw string template value evaluating the dockerize environment, but this results in an always true condition, because a non-empty raw string value always evaluates to true, without evaluating the embedded template (which can only be evaluated by dockerize at Cadence server docker container start).

I'm going to fix this incorrect Helm templating for 0.18, 0.19 and 0.20 chart versions in a new backport patch version and also correct the fallback value of the proposed 0.21 chart when the values.yaml doesn't define the clusterMetadata (legacy, backward-compatible case with old values.yaml configurations).

pregnor commented 3 years ago

I created a short-term hotfix backports for the previous versions to correct the default clusterMetadata value in the chart.

These releases are already out with versions 0.18.1, 0.19.1, 0.20.3. Those should work out of the box with global domains turned off and using only a single Cadence cluster named primary. That was the fastest thing I could do.

I'm still waiting for reviews for the chart 0.21 PRs, that is going to include turning on global domains and also making clusterMetadata configurable from the chart's values.yaml.

pregnor commented 3 years ago

The long-term solution of making the clusterMetadata section chart-configurable has just been released in the chart version 0.21.0.