argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.93k stars 5.46k forks source link

App synch fails with ComparisonError rpc error: code = DeadlineExceeded desc = context deadline exceeded #3864

Open papanito opened 4 years ago

papanito commented 4 years ago

Checklist:

Describe the bug

I've added different apps to my argo-cd but all of the fail to synch with the same error

ComparisonError  rpc error: code = DeadlineExceeded desc = context deadline exceeded

To Reproduce

Here is the manifest from one of the demo projects which fails

project: default
source:
  repoURL: 'https://github.com/argoproj/argocd-example-apps.git'
  path: kustomize-guestbook
  targetRevision: HEAD
destination:
  server: 'https://kubernetes.default.svc'
  namespace: guestbook

Remarks

Not sure if this matters however...

Expected behavior

Synchronization works fine and without errors

Version

argocd: v1.6.1+159674e
  BuildDate: 2020-06-19T00:39:46Z
  GitCommit: 159674ee844a378fb98fe297006bf7b83a6e32d2
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
argocd-server: v1.6.1+159674e
  BuildDate: 2020-06-19T00:41:05Z
  GitCommit: 159674ee844a378fb98fe297006bf7b83a6e32d2
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: {Version:kustomize/v3.6.1 GitCommit:c97fa946d576eb6ed559f17f2ac43b3b5a8d5dbd BuildDate:2020-05-27T20:47:35Z GoOs:linux GoArch:amd64}
  Helm Version: version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"}
  Kubectl Version: v1.14.0

Logs

argocd app get demo
Name:               demo
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          guestbook
URL:                https://argocd.mycompany.com/applications/demo
Repo:               https://github.com/argoproj/argocd-example-apps.git
Target:             HEAD
Path:               kustomize-guestbook
SyncWindow:         Sync Allowed
Sync Policy:        <none>
Sync Status:        Unknown
Health Status:      Healthy

CONDITION        MESSAGE                                                              LAST TRANSITION
ComparisonError  rpc error: code = DeadlineExceeded desc = context deadline exceeded  2020-06-27 19:06:51 +0200 CEST

argocd-server

time="2020-07-01T09:18:45Z" level=info msg="Requested app 'demo' refresh"
time="2020-07-01T09:18:45Z" level=warning msg="finished unary call with code DeadlineExceeded" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=DeadlineExceeded grpc.method=RevisionMetadata grpc.service=application.ApplicationService grpc.start_time="2020-07-01T09:15:41Z" grpc.time_ms=183821.61 span.kind=server system=grpc

**argocd-application-controller***

I suspect the issue is cause if the i/o timeout

2020/07/01 09:18:05 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56566->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:05Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56566->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=1 git_ms=460 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2020-07-01T09:18:05Z" level=info msg="Update successful" application=demo
time="2020-07-01T09:18:05Z" level=info msg="Reconciliation completed" application=demo dedup_ms=0 dest-namespace=guestbook dest-server="https://kubernetes.default.svc" diff_ms=1 fields.level=2 git_ms=460 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=12570
time="2020-07-01T09:18:07Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=demo
time="2020-07-01T09:18:07Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: guestbook)" application=demo
time="2020-07-01T09:18:08Z" level=info msg="getRepoObjs stats" application=demo build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=542 unmarshal_ms=542 version_ms=0
2020/07/01 09:18:20 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56652->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:20Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56652->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=2 git_ms=543 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2020-07-01T09:18:20Z" level=info msg="Update successful" application=demo
time="2020-07-01T09:18:20Z" level=info msg="Reconciliation completed" application=demo dedup_ms=0 dest-namespace=guestbook dest-server="https://kubernetes.default.svc" diff_ms=2 fields.level=2 git_ms=543 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=12636
time="2020-07-01T09:18:22Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=demo
time="2020-07-01T09:18:22Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: guestbook)" application=demo
time="2020-07-01T09:18:22Z" level=info msg="getRepoObjs stats" application=demo build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=633 unmarshal_ms=633 version_ms=0
2020/07/01 09:18:34 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56756->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:34Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56756->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=2 git_ms=633 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2020-07-01T09:18:35Z" level=info msg="Update successful" application=demo
time="2020-07-01T09:18:35Z" level=info msg="Reconciliation completed" application=demo dedup_ms=0 dest-namespace=guestbook dest-server="https://kubernetes.default.svc" diff_ms=2 fields.level=2 git_ms=633 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=12756
time="2020-07-01T09:18:38Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=demo
time="2020-07-01T09:18:38Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: guestbook)" application=demo
time="2020-07-01T09:18:38Z" level=info msg="getRepoObjs stats" application=demo build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=384 unmarshal_ms=384 version_ms=0
2020/07/01 09:18:50 cache: Set key="app|resources-tree|demo|1.0.0" failed: read tcp 10.42.67.206:56842->10.43.73.252:6379: i/o timeout
time="2020-07-01T09:18:50Z" level=error msg="Failed to cache app resources: read tcp 10.42.67.206:56842->10.43.73.252:6379: i/o timeout" application=demo dedup_ms=0 diff_ms=2 git_ms=385 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0

For convenience, I've attached the logs from argocd-server and argocd-application-controller

WaldoFR commented 4 years ago

Have you tried to increase status-processors , operation-processors and repo-server-timeout-seconds in your application-controller deployment as it was done in this issue https://github.com/argoproj/argo-cd/issues/1641 ?

papanito commented 4 years ago

@WaldoFR thanks for the hint. I've added the following

- --status-processors
- "20"
- --operation-processors
- "10"
- --repo-server-timeout-seconds
- "180"

I am surprised as I only have 3 apps so far and I only sync them manually for now, so I would expect syncing one app should work and no timeout occurs

WaldoFR commented 4 years ago

Do you still face the synchronization problems ?

papanito commented 4 years ago

No seems better now, apps show OutOfSynch which is fine. What helped was the increase of the timeout to 180

papanito commented 4 years ago

So it only temporarily worked - I got at least a status OutOfSync which was correct and after a sync I got the message Synch OK. But eventually I get again the same problem. Status shows Unknown and I can see the error again. Also when I try to check the app-diff I get Unable to load data: ComparisonError: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Screenshot-20200629151348-1920x1080

WaldoFR commented 4 years ago

Ok, we face the same problem (in v1.5.3), and increasing number of argocd-repo-server replicas and changing --repo-server-timeout-seconds does not help neither.

time="2020-06-29T14:03:41Z" level=error msg="finished unary call with code Unknown" error="Get \"https://private-repo.com/projet.deploy/project/project.git/info/refs?service=git-upload-pack\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" grpc.code=Unknown grpc.method=GenerateManifest grpc.request.deadline="2020-06-29T14:04:26Z" grpc.service=repository.RepoServerService grpc.start_time="2020-06-29T14:03:26Z" grpc.time_ms=15002.009 span.kind=server system=grpc

Seems that some other people are facing issue ( i.e. https://github.com/argoproj/argo-cd/issues/3547#issuecomment-649593167 )

Related to this issue (https://github.com/argoproj/argo-cd/issues/2957) may I ask you if you are currently deploying helm3 charts on your cluster ? It seems that we face this problems only on clusters with helm3 charts but we can't ensure it's related.

Regards,

EDIT : Additionnal informations, sometimes repository that are not syncing are synchronized for a few minutes.

papanito commented 4 years ago

I have a single cluster and yes I have some stuff like cert-manager deployed with helm 3

papanito commented 4 years ago

If I switch back to v.1.5.0 I don't have the issue and sync work fine - updated the description with the new info.

papanito commented 4 years ago

It's does not seem it's the version, I have now the same issue with v.1.5.0

papanito commented 4 years ago

argocd-redis.log

papanito commented 4 years ago

So in my case is that I get a 10.42.67.206:56842->10.43.73.252:6379: i/o timeout whereas the ip is the cluster ip of redis-server

kubectl -n argocd describe service argocd-redis
Name:              argocd-redis
Namespace:         argocd
Labels:            app.kubernetes.io/component=redis
                   app.kubernetes.io/instance=argocd
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=argocd-redis
                   app.kubernetes.io/part-of=argocd
                   helm.sh/chart=argo-cd-2.5.0
Annotations:       Selector:  app.kubernetes.io/name=argocd-redis
Type:              ClusterIP
IP:                10.43.73.252
Port:              <unset>  5379/TCP
TargetPort:        5379/TCP
Endpoints:         10.42.112.147:5379
Session Affinity:  None
Events:            <none>
papanito commented 4 years ago

I completely removed argocd and installed it in a new namespace. So far it seems good and I don't have the issue mentioned. Still not clear why I had it in the first place.

WaldoFR commented 4 years ago

Have you proceed to auto-sync same deployments of your previous ArgoCD setup ?

papanito commented 4 years ago

@WaldoFR no at the moment I don't use auto-sync

WaldoFR commented 4 years ago

In my case I found that all commands were stopped at 15s exactly, browsing the code I found this part here that would explain we stop synchronization after 15 seconds :

// Returns a HTTP client object suitable for go-git to use using the following
// pattern:
// - If insecure is true, always returns a client with certificate verification
//   turned off.
// - If one or more custom certificates are stored for the repository, returns
//   a client with those certificates in the list of root CAs used to verify
//   the server's certificate.
// - Otherwise (and on non-fatal errors), a default HTTP client is returned.
func GetRepoHTTPClient(repoURL string, insecure bool, creds Creds) *http.Client {
    // Default HTTP client
    var customHTTPClient = &http.Client{
        // 15 second timeout
        Timeout: 15 * time.Second,
        // don't follow redirect
        CheckRedirect: func(req *http.Request, via []*http.Request) error {
            return http.ErrUseLastResponse
        },
    }

Git client seems to not handle timeout specified in --repo-server-timeout-seconds so, even if the GRPC request is extended the git fetch is not. Should this option also be applied to git client ?

Regards,

References : https://github.com/argoproj/argo-cd/blob/e92e0fa4090a0c324b7a7dbf3d7a2e16ab281d85/util/git/client.go#L112

dmitrievav commented 4 years ago

I have the same issue with helm3, argocd 1.5.3 and cert-manager 0.16.1

error="application refresh deadline exceeded"
dmitrievav commented 4 years ago
apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-application-controller
spec:
  template:
    spec:
      containers:
      - name: argocd-application-controller
        command:
        - argocd-application-controller
        - --status-processors
        - "20"
        - --operation-processors
        - "50"
        - --repo-server-timeout-seconds
        - "300"
argocd/argocd-server-5b9466d7f-x62db[argocd-server]: time="2020-08-25T16:04:33Z" level=error msg="finished unary call with code Unknown" error="application refresh deadline exceeded" grpc.code=Unknown grpc.method=ResourceTree grpc.service=application.ApplicationService grpc.start_time="2020-08-25T16:03:33Z" grpc.time_ms=60003.48 span.kind=server system=grpc

and I have only dozen applications in control repo ...

cemsusal commented 4 years ago

I have the same issue while adding a new https repository from bitbucket and gitlab:

Having following error message:

Unable to connect HTTPS repository: Get "https://bitbucket.org/xxx/yyy/info/refs?service=git-upload-pack": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I enabled --insecure flag for argocd-server

These are the settings applied to the argocd-application-controller:

- command:
        - argocd-application-controller
        - --status-processors
        - "20"
        - --operation-processors
        - "50"
        - --repo-server-timeout-seconds
        - "420"
        - --app-resync
        - "600"

I also tried to scale up argocd-repo-server (5 replicas) and it did not work.

git clone inside argocd-repo-server is working fine.

Installed following ArgoCD versions and had the same issue for all of them:

My cluster has 3 nodes and 1 master.

This is happening for manual and auto sync policies while creating a new app.

cemsusal commented 4 years ago

My argocd-repo-server logs:

time="2020-08-27T09:25:05Z" level=info msg="Initializing GnuPG keyring at /app/config/gpg/keys"
time="2020-08-27T09:25:05Z" level=info msg="gpg --no-permission-warning --logger-fd 1 --batch --generate-key /tmp/gpg-key-recipe563578480" dir= execID=mX2zE
time="2020-08-27T09:25:06Z" level=info msg="Populating GnuPG keyring with keys from /app/config/gpg/source"
time="2020-08-27T09:25:06Z" level=info msg="gpg --no-permission-warning --list-public-keys" dir= execID=iYECJ
time="2020-08-27T09:25:06Z" level=info msg="gpg --no-permission-warning -a --export 22088F9D0E59A460" dir= execID=k0Ftj
time="2020-08-27T09:25:06Z" level=info msg="gpg-wrapper.sh --no-permission-warning --list-secret-keys 22088F9D0E59A460" dir= execID=grdLd
time="2020-08-27T09:25:06Z" level=info msg="Loaded 0 (and removed 0) keys from keyring"
time="2020-08-27T09:25:06Z" level=info msg="argocd-repo-server v1.7.1+da5fa74 serving on [::]:8081"

I have got a bunch of no permission warning.

jannfis commented 4 years ago

I have got a bunch of no permission warning.

These are no warnings. It's just a log of a command that got executed, which happen to be executed using a parameter --no-permission-warning

jannfis commented 4 years ago

This is happening for manual and auto sync policies while creating a new app.

Hi @cemsusal, is it happening when you create an app or when you add the repository? For the latter, an initial connection attempt is made by argocd-server (and not the repo server). Is the argocd-server pod allowed to speak to Internet hosts or are you behind a HTTP proxy maybe?

cemsusal commented 4 years ago

This is happening while creating and app or adding a repository from bitbucket or gitlab. (Did not try github)

ARGOCD_GPG_ENABLED -> Is now set the false and issue still persists.

We are not behind a proxy. argocd-server can access to the internet.

cemsusal commented 4 years ago

This is a log from our repo-server:

time="2020-08-27T12:10:54Z" level=error msg="finished unary call with code Unknown" error="Get \"https://zzz:***@bitbucket.org/xxx/yyy.git/info/r efs?service=git-upload-pack\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" grpc.code=Unknown grpc.method=GenerateManifest grpc.req uest.deadline="2020-08-27T12:17:19Z" grpc.service=repository.RepoServerService grpc.start_time="2020-08-27T12:10:39Z" grpc.time_ms=15001.042 span.kind=server syst em=grpc

This is confirming that repo server is failing to fetch.

cemsusal commented 4 years ago

@jannfis this is the output of git commands:

argocd@argocd-repo-server-6ddb994d9-s49gf:/tmp/https:__xxx:yyy@bitbucket.org_project$ git remote -v
origin  https://ttt:zzz@bitbucket.org/xxx/yyy.git (fetch)
origin  https://ttt:zzz@bitbucket.org/xxx/yyy.git (push)
argocd@argocd-repo-server-6ddb994d9-s49gf:/tmp/https:__xxx:yyy@bitbucket.org_project$ git pull origin master
From https://bitbucket.org/xxx/yyy
 * branch            master     -> FETCH_HEAD
Already up to date.
cemsusal commented 4 years ago

I have only one application on top of a fresh install of v.1.7.1. The issue still persists.

cemsusal commented 4 years ago

Installing Argo CD in another namespace did not resolve the issue for me.

dmitrievav commented 4 years ago

it may depends on this https://github.com/argoproj/argo-cd/issues/3998 As new version of cert-manager has a huge manifest file (>1MB)

onelapahead commented 3 years ago

We're experiencing something similar in v1.7.6.

Our application controller started spewing this log this morning:

patch: {"status":{"conditions":[{"lastTransitionTime":"2020-12-03T12:33:40Z","message":"rpc error: code = Unknown desc = Get \"https://github.com/Bandwidth/some-repo/info/refs?service=git-upload-pack\": dial tcp 140.82.113.3:443: i/o timeout (Client.Timeout exceeded while awaiting headers)","type":"ComparisonError"}]}}
--

The reason being our network is saturated and so packets are being dropped, connections timing out, etc.

So while this kind of behavior is expected during a network issue, I'm concerned as to the impact its having on application syncs.

What bothers me, is that the repo server hasn't thrown a single error and everything it needs is still in its cache. So why is the app controller timing out on syncs and putting a bunch of apps in the Unknown state, when it has everything it needs to do its job?

What is the point of the app controller hitting the Git repo with git-upload-pack?

kintarowins commented 3 years ago

same problem with v1.8.3+0f9c684

WaldoFR commented 3 years ago

Again, git timeout might be caused by slowness not related to ArgoCD. For instance the version control server you are using itself or tiers services related for authentication on it.

To troubleshoot that, ensure that realizing a git clone from your node with sames credentials as you configured won't last over 15 seconds.

willemm commented 2 years ago

What if the git timeout is definitely caused by slowness not (directly) related to ArgoCD ?

To be more precise, what if the slowness is caused by rate limiting? Then argocd will start retrying all the calls that are rate-limited for more than 15 seconds, thus hammering the git server with more and more requests, thereby indirectly causing the rate limiting to increase.

It would certainly be very useful if it were possible to change that timeout setting, in that situation.

(PS: Sorry if this comes across emotionally, I just had a bad day trying to figure out why our production applications suddenly weren't updating any more, and why azuregit was suddenly refusing to talk to argocd.)

mnasruul commented 2 years ago

thanks a lot man, i follow this and i fix my issue. before i googling 3 days to fixed it.

christophercutajar commented 2 years ago

We're experiencing the same timeout issues after deploying an ApplicationSet which creates 9 applications.

level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2022-01-24T19:20:45Z\",\"message\":\"rpc error: code = Unknown desc = `helm dependency build` failed timeout after 1m30s\",\"type\":\"ComparisonError\"}]}}" application=app

ArgoCD version deployed: v2.2.2+03b17e0 ArgoCD ApplicationSet version: v0.3.0

Controller settings:

controller:
    args:
      # -- define the application controller `--status-processors`
      statusProcessors: "60"
      # -- define the application controller `--operation-processors`
      operationProcessors: "40"
      # -- define the application controller `--repo-server-timeout-seconds`
      repoServerTimeoutSeconds: "360"

GitHub API rate limits are 5000 API calls per hour, or ~83 calls per minute. I'm not knowledgeable enough about ArgoCD engine to say that we're hitting such but limits but I think it could be very plausible that such limits are being reached as I'm able to create the Applications one by one!

UPDATE: I've also added appResyncPeriod together with the above configs and re-deployed. Being patient was key now! 🔑 After about 10minutes, none of my 9 applications where in an Unkown state anymore

#  - define the application controller `--app-resync`
      appResyncPeriod: "300"

UPDATE 2: At the seconds refresh, the repo-server started throwing timeouts again!

level=error msg="finished unary call with code Unknown" error="Manifest generation error (cached): `helm dependency build` failed timeout after 1m30s" grpc.code=Unknown grpc.method=GenerateManifest grpc.request.deadline="2022-01-25T14:18:48Z" grpc.service=repository.RepoServerService grpc.start_time="2022-01-25T14:12:48Z" grpc.time_ms=330.102 span.kind=server system=grpc
tooptoop4 commented 2 years ago

@mnasruul what was the fix?

mnasruul commented 2 years ago

@mnasruul what was the fix?

follow this https://github.com/argoproj/argo-cd/issues/3864#issuecomment-681846012

pp185214 commented 2 years ago

Maybe it could help to somebody as well. We have ArgoCD on AKS cluster and we faced similar issue. According https://argo-cd.readthedocs.io/en/release-1.8/operator-manual/high_availability/ we have tried to set workaround: --repo-server-timeout-seconds for argocd-application-controller and issue disapiered for now.