argoproj / applicationset

The ApplicationSet controller manages multiple Argo CD Applications as a single ApplicationSet unit, supporting deployments to large numbers of clusters, deployments of large monorepos, and enabling secure Application self-service.
https://argocd-applicationset.readthedocs.io/
Apache License 2.0
584 stars 278 forks source link

argocd-applicationset-controller Pod loses ssh-known-hosts ConfigMap mount #338

Open alexanderdalloz opened 3 years ago

alexanderdalloz commented 3 years ago

Today we faced the second time on a different cluster than before the case that the argocd-applicationset-controller v0.1.0 Pod has lost its ConfigMap mount of the ssh-known-hosts.

Following is being logged by the Pod in such case:

time="2021-08-11T09:40:48Z" level=info msg="git fetch origin --tags --force" dir="/tmp/git@git.local:7999_opsrepos_myapplication" execID=98vW5
time="2021-08-11T09:40:48Z" level=error msg="`git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork" execID=98vW5
time="2021-08-11T09:40:48Z" level=info msg=Trace args="[git fetch origin --tags --force]" dir="/tmp/git@git.local:7999_opsrepos_myapplication" operation_name="exec git" time_ms=2.6243960000000004
time="2021-08-11T09:40:48Z" level=error msg="error generating params" error="Error during fetching repo: `git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork" generator="&{0xc0004dd820}"
2021-08-11T09:40:48.983Z        ERROR   controller-runtime.manager.controller.applicationset    Reconciler error        {"reconciler group": "argoproj.io", "reconciler kind": "ApplicationSet", "name": "myapplication", "namespace": "argocd", "error": "Error during fetching repo: `git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork", "errorVerbose": "`git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork\nError during fetching repo\ngithub.com/argoproj-labs/applicationset/pkg/services.checkoutRepo\n\t/workspace/pkg/services/repo_service.go:159\ngithub.com/argoproj-labs/applicationset/pkg/services.(*argoCDService).GetDirectories\n\t/workspace/pkg/services/repo_service.go:84\ngithub.com/argoproj-labs/applicationset/pkg/generators.(*GitGenerator).generateParamsForGitDirectories\n\t/workspace/pkg/generators/git.go:78\ngithub.com/argoproj-labs/applicationset/pkg/generators.(*GitGenerator).GenerateParams\n\t/workspace/pkg/generators/git.go:62\ngithub.com/argoproj-labs/applicationset/pkg/controllers.(*ApplicationSetReconciler).generateApplications\n\t/workspace/pkg/controllers/applicationset_controller.go:349\ngithub.com/argoproj-labs/applicationset/pkg/controllers.(*ApplicationSetReconciler).Reconcile\n\t/workspace/pkg/controllers/applicationset_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99

No mass event of such type of defect seen for other applications. All other Pods of Argo CD on the same cluster do not have that issue at the same time, i.e. the argocd-repo-server has the same ConfigMap mounted runs just fine. Runs sind 15 days and has 18k log lines available by now.

$ oc logs argocd-repo-server-8548544d4f-nsc6p | grep -c "Resource temporarily unavailable"
0
OpenShhift cluster version:
Server Version: 4.6.29
Kubernetes Version: v1.19.0+d856161
jgwest commented 3 years ago

Hi @alexanderdalloz, I think the error message isn't related to ssh_known_hosts, it's actually just LOOKS like it is based on the error output format. The actual error message section is this: failed exit status 128: error: cannot fork() for ssh (more info here)

When the git command runs, it creates a new ssh process in order to connect to the Git repository via SSH, and that is what this error message is related to. It is saying that for whatever reason it is unable to create a new process.

A few thoughts:

alexanderdalloz commented 3 years ago

Hi @jgwest, please let me answer your questions:

jgwest commented 3 years ago

OK, everything there sounds correct, and it doesn't seem like anything you describe above would cause the issue.

Is it possible that some other container/OS process on the worker node is hogging all the processes? I'm not a Kubernetes/OpenShift admin so I don't know if there is an official way of checking this, beyond just SSHing into the underlying worker node and doing sudo ps -ef | wc -l to count the number of processes :thinking: .

Beyond this, I'm not sure what else to suggest. It's the operating system that is blocking the fork, and thus it's either OpenShift/Kubernetes, or the actual underlying Linux Kernel/OS that's returning that error, and it doesn't seem like the ApplicationSet controller is contributing to the issue (at least as far as you have described thus far).

alexanderdalloz commented 3 years ago

Hi @jgwest, while trying to find out which load is hogging the node ressources we like to implement a detection mechanism to autorestart the applicationset-controller Pod in such a case. Ideally not on the same node. applicationset-controller --help states

  -metrics-addr string
        The address the metric endpoint binds to. (default ":8080")

as a possible parameter. Would that allow for a liveness probe? Or -probe-addr string?

jgwest commented 3 years ago

Those values are exposed, but my understanding is they came from kubebuilder template, and haven't had a lot of love since then, so they aren't guaranteed to actually work as expected :smile: . For example: https://github.com/argoproj-labs/applicationset/issues/123

If you do have success using them, I'd be interested to hear.

alexanderdalloz commented 3 years ago

Hi @jgwest, I hope we can reactivate activity on this issue as it persists with ApplicationSet Controller v0.2.0. Do you have a hint how to debug the situation?

To provide an example output of the process table from within the applicationset-controller Pod which gathers zombie processes up to the defined limit:

1000680000@argocd-applicationset-controller-55b7874c46-s4v9x:/$ ps axuwwww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1000680+       1  0.9  0.0 770824 100520 ?       Ssl  07:05   3:52 applicationset-controller --loglevel warn
1000680+     344  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     368  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     409  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     412  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     415  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     418  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     421  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     424  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     427  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     430  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     433  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     436  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     439  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     442  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     461  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     464  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     467  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     470  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     473  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     476  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     487  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     490  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     493  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     496  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     499  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     502  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     505  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     508  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     686  0.0  0.0      0     0 ?        Z    07:07   0:00 [sh] <defunct>
1000680+     689  0.0  0.0      0     0 ?        Z    07:07   0:00 [sh] <defunct>
1000680+     772  0.0  0.0      0     0 ?        Z    07:08   0:00 [sh] <defunct>
1000680+     775  0.0  0.0      0     0 ?        Z    07:08   0:00 [sh] <defunct>
1000680+    1280  0.0  0.0      0     0 ?        Z    07:11   0:00 [sh] <defunct>
1000680+    1283  0.0  0.0      0     0 ?        Z    07:11   0:00 [sh] <defunct>
1000680+    1603  0.0  0.0   5776  1972 pts/0    Ss   07:14   0:00 sh -i -c TERM=xterm sh
1000680+    1609  0.0  0.0   5776  1920 pts/0    S    07:14   0:00 sh
1000680+    1622  0.0  0.0   7612  4416 pts/0    S+   07:14   0:00 /bin/bash
1000680+    1972  0.0  0.0      0     0 ?        Z    07:16   0:00 [sh] <defunct>
1000680+    1975  0.0  0.0      0     0 ?        Z    07:16   0:00 [sh] <defunct>
1000680+    3072  0.0  0.0      0     0 ?        Z    07:27   0:00 [sh] <defunct>
1000680+    3075  0.0  0.0      0     0 ?        Z    07:27   0:00 [sh] <defunct>
1000680+    4821  0.0  0.0      0     0 ?        Z    07:44   0:00 [sh] <defunct>
1000680+    4824  0.0  0.0      0     0 ?        Z    07:44   0:00 [sh] <defunct>
1000680+    7050  0.0  0.0      0     0 ?        Z    08:01   0:00 [sh] <defunct>
1000680+    7053  0.0  0.0      0     0 ?        Z    08:01   0:00 [sh] <defunct>
1000680+    9100  0.0  0.0      0     0 ?        Z    08:17   0:00 [sh] <defunct>
1000680+    9103  0.0  0.0      0     0 ?        Z    08:17   0:00 [sh] <defunct>
1000680+   10798  0.0  0.0      0     0 ?        Z    08:34   0:00 [sh] <defunct>
1000680+   10801  0.0  0.0      0     0 ?        Z    08:34   0:00 [sh] <defunct>
1000680+   12395  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
1000680+   12398  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
1000680+   14992  0.0  0.0      0     0 ?        Z    09:07   0:00 [sh] <defunct>
1000680+   14995  0.0  0.0      0     0 ?        Z    09:07   0:00 [sh] <defunct>
1000680+   16652  0.0  0.0      0     0 ?        Z    09:24   0:00 [sh] <defunct>
1000680+   16655  0.0  0.0      0     0 ?        Z    09:24   0:00 [sh] <defunct>
1000680+   18516  0.0  0.0      0     0 ?        Z    09:41   0:00 [sh] <defunct>
1000680+   18527  0.0  0.0      0     0 ?        Z    09:41   0:00 [sh] <defunct>
1000680+   20639  0.0  0.0      0     0 ?        Z    09:57   0:00 [sh] <defunct>
1000680+   20642  0.0  0.0      0     0 ?        Z    09:57   0:00 [sh] <defunct>
1000680+   22363  0.0  0.0      0     0 ?        Z    10:14   0:00 [sh] <defunct>
1000680+   22366  0.0  0.0      0     0 ?        Z    10:14   0:00 [sh] <defunct>
1000680+   24213  0.0  0.0      0     0 ?        Z    10:31   0:00 [sh] <defunct>
1000680+   24216  0.0  0.0      0     0 ?        Z    10:31   0:00 [sh] <defunct>
1000680+   25872  0.0  0.0      0     0 ?        Z    10:48   0:00 [sh] <defunct>
1000680+   25875  0.0  0.0      0     0 ?        Z    10:48   0:00 [sh] <defunct>
1000680+   27696  0.0  0.0      0     0 ?        Z    11:04   0:00 [sh] <defunct>
1000680+   27699  0.0  0.0      0     0 ?        Z    11:04   0:00 [sh] <defunct>
1000680+   29260  0.0  0.0      0     0 ?        Z    11:21   0:00 [sh] <defunct>
1000680+   29263  0.0  0.0      0     0 ?        Z    11:21   0:00 [sh] <defunct>
1000680+   31056  0.0  0.0      0     0 ?        Z    11:38   0:00 [sh] <defunct>
1000680+   31059  0.0  0.0      0     0 ?        Z    11:38   0:00 [sh] <defunct>
1000680+   32839  0.0  0.0      0     0 ?        Z    11:54   0:00 [sh] <defunct>
1000680+   32842  0.0  0.0      0     0 ?        Z    11:54   0:00 [sh] <defunct>
1000680+   34811  0.0  0.0      0     0 ?        Z    12:11   0:00 [sh] <defunct>
1000680+   34814  0.0  0.0      0     0 ?        Z    12:11   0:00 [sh] <defunct>
1000680+   36391  0.0  0.0      0     0 ?        Z    12:28   0:00 [sh] <defunct>
1000680+   36394  0.0  0.0      0     0 ?        Z    12:28   0:00 [sh] <defunct>
1000680+   37163  0.0  0.0      0     0 ?        Z    12:32   0:00 [sh] <defunct>
1000680+   37174  0.0  0.0      0     0 ?        Z    12:32   0:00 [sh] <defunct>
1000680+   38408  0.0  0.0      0     0 ?        Z    12:44   0:00 [sh] <defunct>
1000680+   38411  0.0  0.0      0     0 ?        Z    12:44   0:00 [sh] <defunct>
1000680+   40174  0.0  0.0      0     0 ?        Z    13:01   0:00 [sh] <defunct>
1000680+   40177  0.0  0.0      0     0 ?        Z    13:01   0:00 [sh] <defunct>
1000680+   41909  0.0  0.0      0     0 ?        Z    13:18   0:00 [sh] <defunct>
1000680+   41912  0.0  0.0      0     0 ?        Z    13:18   0:00 [sh] <defunct>
1000680+   43791  0.0  0.0      0     0 ?        Z    13:34   0:00 [sh] <defunct>
1000680+   43794  0.0  0.0      0     0 ?        Z    13:34   0:00 [sh] <defunct>
1000680+   45531  0.0  0.0      0     0 ?        Z    13:51   0:00 [sh] <defunct>
1000680+   45534  0.0  0.0      0     0 ?        Z    13:51   0:00 [sh] <defunct>
1000680+   46854  0.0  0.0   7612  4620 pts/1    Ss   14:01   0:00 bash
1000680+   47650  0.0  0.0      0     0 ?        Z    14:08   0:00 [sh] <defunct>
1000680+   47653  0.0  0.0      0     0 ?        Z    14:08   0:00 [sh] <defunct>
1000680+   47817  0.0  0.0   9984  3820 pts/1    R+   14:10   0:00 ps axuwwww
1000680000@argocd-applicationset-controller-55b7874c46-s4v9x:/$

PPID of the zombies is PID 1.

The limit:

sh-4.4# crio config | egrep ‘pids_limit’
pids_limit = 1024

We access our Bitbucket hosted git repos via SSH.

alexanderdalloz commented 3 years ago

@jgwest we are a step further in debugging the situation. We are able to correlate the occurance of the defunct processes with the occurance of the log messages

level=error msg="error generating application from params" error="Error during fetching repo:git fetch origin master --tags --forcefailed exit status 128:

by 100%. The git fetch errors result from different situations where the git repo branch is not fetchable. For instance when we have provisioned the application via ApplicationSet for the team including generated the GitOps repo for the Helm Chart while the team hasn't so far populated their repo branch.

Am I correct that you fully rely on the https://github.com/go-git/go-git/ code and that the behaviour would have to be fixed there?