ENHANCE_YOUR_CALM and too_many_pings (again)

jcogilvie commented 1 year ago

Terraform Version, ArgoCD Provider Version and ArgoCD Version

Terraform version: 1.4.6
ArgoCD provider version: 5.6.0
ArgoCD version: 2.6.7

Affected Resource(s)

argocd_application
probably others too
Terraform Configuration Files

A generic multi-source application w/6 sources; all of them helm; all of them with values inline on their source object

Output

module.this_app[0].argocd_application.this: Modifying... [id=crm-pushback:argocd]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 10s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 20s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 30s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 40s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 50s elapsed]
module.this_app[0].argocd_application.this: Still modifying... [id=crm-pushback:argocd, 1m0s elapsed]

│ Error: failed to update application crm-pushback
│ 
│   with module.this_app[0].argocd_application.this,
│   on .terraform/modules/this_app/main.tf line 117, in resource "argocd_application" "this":
│  117: resource "argocd_application" "this" {
│ 
│ rpc error: code = Unavailable desc = closing transport due to: connection
│ error: desc = "error reading from server: EOF", received prior goaway:
│ code: ENHANCE_YOUR_CALM, debug data: "too_many_pings"

Steps to Reproduce

terraform apply

Expected Behavior

Update is applied

Actual Behavior

Failed with above error

Important Factoids

public argo endpoint to an EKS cluster

References

https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md

When a client receives a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings", it should log the occurrence at a log level that is enabled by default and double the configure KEEPALIVE_TIME used for new connections on that channel.
124

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jcogilvie commented 1 year ago

For a sample size of one, I had some better luck with this after setting grpc_web = true on the provider config. I'll see if it recurs, but further validation would be helpful.

onematchfox commented 1 year ago

@jcogilvie you mentioned here that the issue is repeatable. Any chance you can share that config?

jcogilvie commented 1 year ago

Well, there's a lot of terraform machinery around how it's actually configured, but I can give you a generalized lay of the terraform land, plus the app manifest that ends up being applied. I hope that's close enough.

Here's the (minimized) manifest:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: crm-pushback
  namespace: argocd
spec:
  destination:
    namespace: crm-pushback
    server: https://kubernetes.default.svc
  project: crm-pushback
  revisionHistoryLimit: 10
  sources:
    - chart: mycompany-api-service
      helm:
        releaseName: api
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: first-query-complete-receiver
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: first-status-poller
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: second-query-complete-receiver
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-consumer
      helm:
        releaseName: second-status-poller
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
    - chart: mycompany-cronjob
      helm:
        releaseName: syncqueries
        values: |
          enabled: true
          otherValues: here
      repoURL: https://mycompany.helm.repo/artifactory/default-helm/
      targetRevision: ~> 2.2.0
  syncPolicy:
    automated: {}
    retry:
      backoff:
        duration: 30s
        factor: 2
        maxDuration: 2m
      limit: 5

It's built through this tf module:

resource "argocd_repository" "this" {
  repo    = data.github_repository.this.http_clone_url
  project = argocd_project.this.metadata[0].name

  lifecycle {
    # these get populated upstream by argo
    ignore_changes = [githubapp_id, githubapp_installation_id]
  }
}

locals {
  helm_repo_url = "https://mycompany.helm.repo/artifactory/default-helm/"

  multiple_sources = [for source in var.services : {
    repo_url        = local.helm_repo_url
    chart           = source.source_chart
    path            = source.local_chart_path != null ? source.local_chart_path : ""
    target_revision = source.local_chart_path != null ? var.target_infra_revision : source.source_chart_version
    helm = {
      release_name = source.name
      values       = source.helm_values
    }
  }]

  sources         = local.multiple_sources
  sources_map     = { for source in local.sources : source.helm.release_name => source }
}

resource "argocd_project" "this" {
  metadata {
    name        = var.service_name
    namespace   = "argocd"
    labels      = {}
    annotations = {}
  }

  spec {
    description = var.description

    source_namespaces = [var.namespace]
    source_repos      = [data.github_repository.this.html_url, local.helm_repo_url]

    destination {
      server    = var.destination_cluster
      namespace = var.namespace
    }

    role {
      name        = "owner"
      description = "Owner access to ${var.service_name}.  Note most operations should be done through terraform."
      policies = [
         ...
      ]
      groups = [
        ...
      ]
    }

  }
}

locals {
  sync_policy = var.automatic_sync_enabled ? {
    automated = {
      allowEmpty = false
      prune      = var.sync_policy_enable_prune
      selfHeal   = var.sync_policy_enable_self_heal
    }
  } : {}
}

resource "argocd_application" "this" {
  count = var.use_raw_manifest ? 0 : 1

  wait = var.wait_for_sync

  metadata {
    name      = var.service_name
    namespace = "argocd"
    labels    = {} # var.tags -- tags fail validation because they contain '/'
  }

  spec {
    project = argocd_project.this.metadata[0].name

    destination {
      server    = var.destination_cluster
      namespace = var.namespace
    }

    dynamic "source" {
      for_each = local.sources_map
      content {
        repo_url        = source.value.repo_url
        path            = source.value.path
        chart           = source.value.chart
        target_revision = source.value.target_revision
        helm {
          release_name = source.value.helm.release_name
          values       = source.value.helm.values
        }
      }
    }

    sync_policy {

      dynamic "automated" {
        for_each = var.automatic_sync_enabled ? {
          automated_sync_enabled = true
        } : {}

        content {
          allow_empty = false
          prune       = var.sync_policy_enable_prune
          self_heal   = var.sync_policy_enable_self_heal
        }
      }

      retry {
        limit = var.sync_retry_limit
        backoff {
          duration     = var.sync_retry_backoff_base_duration
          max_duration = var.sync_retry_backoff_max_duration
          factor       = var.sync_retry_backoff_factor
        }
      }
    }
  }
}

jcogilvie commented 1 year ago

Note that for this specific case, the creation doesn't get too_many_pings; but any kind of an update does (so, e.g., update the image in the sources).

Making it sufficiently bigger can cause too_many_pings on create as well. One of my apps tries to have like 30 sources, which was just too much for the provider (maybe for the CLI?) so I had to skip the provider and go right to a kubernetes_manifest which was somewhat disappointing (though quick).

amedinagar commented 1 year ago

Bump this, im experiencing the same problem when deploying, some times is ramdomly. @jcogilvie how do you skip the provider using kubernetes_manifest? Thanks!

peturgq commented 1 year ago

Bump, I'm also experiencing this on provider version 5.6.0. This happens for me on creation of argocd_cluster.

edit: Upgrading the provider to 6.0.3 does not seem to resolve the issue.

jcogilvie commented 1 year ago

@amedinagar I used a kubernetes_manifest resource with argo's declarative configuration.

There are a few gotchas: 1) make sure you add finalizers 2) you'll probably want a wait statement similar to this:

  wait {
      fields = {
        "status.sync.status" = "Synced"
      }
    }

3) the kubernetes_manifest provider has issues with the argo CRDs as of argo 2.8, when the schema changed to introduce a field with x-kubernetes-preserve-unknown-fields on it. So, my CRDs are presently stuck on argo 2.6.7.

onematchfox commented 1 year ago

Will revisit once GRPCKeepAliveEnforcementMinimum is made configurable in the underlying argocd module. Related to https://github.com/argoproj/argo-cd/issues/15656

renperez-cpi commented 1 year ago

This is also happening to me when I'm using the ArgoCD cli to do the app sync.

jcogilvie commented 1 year ago

@onematchfox looks like the upstream PR has been merged making the keepalive time configurable.

onematchfox commented 1 year ago

@onematchfox looks like the upstream PR has been merged making the keepalive time configurable.

Yeah, I see that. Although, we will need to wait for this to actually be released (at a glance PR was merged into main so it will only be in the 2.9 release - feel free to correct me if I'm wrong) and then, it will take some consideration as to how we implement it here given that we need to support older versions as well.

jcogilvie commented 12 months ago

Looks like 2.9 is released. What kind of consideration are we talking about here? How tightly is the client library coupled to the api?

Perusing the upstream PR, it looks like the server and the api client both expect an environment variable to be set (via common). So, if I'm understanding correctly, the new env var is something we can set in the client process and it'll simply be ignored in the event we need to use an older client lib version.

Given the implementation I actually wonder if setting it here in a new client would also fix the issue when running against an older server version as well.

danielkza commented 9 months ago

I'm still seeing this frequently when connection to argo 2.9.2 all the time. What's the status on moving to the new library?

jcogilvie commented 9 months ago

Any chance this gets looked at soon, @onematchfox? Is there anything we can do to help?

donovanrost commented 9 months ago

I am also experiencing this issue. But when adding a new cluster. I'm happy to provide any additional information to help resolve this. Some additional details: I'm using provider version 6.0.3 ArgoCD information { "Version": "v2.9.6+ba62a0a", "BuildDate": "2024-02-05T11:24:01Z", "GitCommit": "ba62a0a86d19f71a65ec2b510a39ea55497e1580", "GitTreeState": "clean", "GoVersion": "go1.20.13", "Compiler": "gc", "Platform": "linux/amd64", "KustomizeVersion": "(devel) unknown", "HelmVersion": "v3.14.0+g3fc9f4b", "KubectlVersion": "v0.24.17", "JsonnetVersion": "v0.20.0" }

donovanrost commented 8 months ago

I am also experiencing this issue. But when adding a new cluster. I'm happy to provide any additional information to help resolve this. Some additional details: I'm using provider version 6.0.3 ArgoCD information { "Version": "v2.9.6+ba62a0a", "BuildDate": "2024-02-05T11:24:01Z", "GitCommit": "ba62a0a86d19f71a65ec2b510a39ea55497e1580", "GitTreeState": "clean", "GoVersion": "go1.20.13", "Compiler": "gc", "Platform": "linux/amd64", "KustomizeVersion": "(devel) unknown", "HelmVersion": "v3.14.0+g3fc9f4b", "KubectlVersion": "v0.24.17", "JsonnetVersion": "v0.20.0" }

After switching to the official ArgoCD helm chart from the Bitnami chart and updating to version 2.10, this issue has gone away for me

blakepettersson commented 8 months ago

Actually it seems like this only got released with 2.10. I wonder if it is just enough to run a 2.10 server, as @donovanrost seems to have done.

jcogilvie commented 7 months ago

I have a strong suspicion that my case was somehow related to me having an entirely-too-large helm repo index file (~80 megs).

onematchfox commented 7 months ago

Hey folks, Sorry for the lack of response here. As of v6.1.0 the provider is now import v.2.9.9 of argoprg/argocd. I do suspect that this issue is mostly server side so you may need to update you Argo instance to v2.10 as @donovanrost suggested. But, if that doesn't work then we're certainly open to PRs to upgrading the deps in this provider to v2.10 since the changes to the client side code didn't land in 2.9.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

argoproj-labs / terraform-provider-argocd