argocd-applicationset-controller Pod loses ssh-known-hosts ConfigMap mount

Today we faced the second time on a different cluster than before the case that the argocd-applicationset-controller v0.1.0 Pod has lost its ConfigMap mount of the ssh-known-hosts.

Following is being logged by the Pod in such case:

time="2021-08-11T09:40:48Z" level=info msg="git fetch origin --tags --force" dir="/tmp/git@git.local:7999_opsrepos_myapplication" execID=98vW5
time="2021-08-11T09:40:48Z" level=error msg="`git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork" execID=98vW5
time="2021-08-11T09:40:48Z" level=info msg=Trace args="[git fetch origin --tags --force]" dir="/tmp/git@git.local:7999_opsrepos_myapplication" operation_name="exec git" time_ms=2.6243960000000004
time="2021-08-11T09:40:48Z" level=error msg="error generating params" error="Error during fetching repo: `git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork" generator="&{0xc0004dd820}"
2021-08-11T09:40:48.983Z        ERROR   controller-runtime.manager.controller.applicationset    Reconciler error        {"reconciler group": "argoproj.io", "reconciler kind": "ApplicationSet", "name": "myapplication", "namespace": "argocd", "error": "Error during fetching repo: `git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork", "errorVerbose": "`git fetch origin --tags --force` failed exit status 128: error: cannot fork() for ssh -i /dev/shm/169189448 -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/app/config/ssh/ssh_known_hosts: Resource temporarily unavailable\nfatal: unable to fork\nError during fetching repo\ngithub.com/argoproj-labs/applicationset/pkg/services.checkoutRepo\n\t/workspace/pkg/services/repo_service.go:159\ngithub.com/argoproj-labs/applicationset/pkg/services.(*argoCDService).GetDirectories\n\t/workspace/pkg/services/repo_service.go:84\ngithub.com/argoproj-labs/applicationset/pkg/generators.(*GitGenerator).generateParamsForGitDirectories\n\t/workspace/pkg/generators/git.go:78\ngithub.com/argoproj-labs/applicationset/pkg/generators.(*GitGenerator).GenerateParams\n\t/workspace/pkg/generators/git.go:62\ngithub.com/argoproj-labs/applicationset/pkg/controllers.(*ApplicationSetReconciler).generateApplications\n\t/workspace/pkg/controllers/applicationset_controller.go:349\ngithub.com/argoproj-labs/applicationset/pkg/controllers.(*ApplicationSetReconciler).Reconcile\n\t/workspace/pkg/controllers/applicationset_controller.go:93\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"}
github.com/go-logr/zapr.(*zapLogger).Error
        /go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
        /go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99

No mass event of such type of defect seen for other applications. All other Pods of Argo CD on the same cluster do not have that issue at the same time, i.e. the argocd-repo-server has the same ConfigMap mounted runs just fine. Runs sind 15 days and has 18k log lines available by now.

$ oc logs argocd-repo-server-8548544d4f-nsc6p | grep -c "Resource temporarily unavailable"
0

OpenShhift cluster version:
Server Version: 4.6.29
Kubernetes Version: v1.19.0+d856161

Hi @alexanderdalloz, I think the error message isn't related to ssh_known_hosts, it's actually just LOOKS like it is based on the error output format. The actual error message section is this: failed exit status 128: error: cannot fork() for ssh (more info here)

When the git command runs, it creates a new ssh process in order to connect to the Git repository via SSH, and that is what this error message is related to. It is saying that for whatever reason it is unable to create a new process.

A few thoughts:

Are you using OpenShift's SecurityContextContraints to restrict the pods?
Are you using Process ID limits to restrict the number of PIDs that are available?
Any restarts on the ApplicationSet controller pod? (Implying that it was hitting it's memory limits?)
If you open a terminal into the ApplicationSet container (OpenShift will let you do this through the console), and do ps -ef, how many git processes does it show? (I'm hoping there aren't a bunch of leftover git processes)
And/or anything else that comes to mind in this area that you might be doing on your cluster?

Hi @jgwest, please let me answer your questions:

yes, SCCs are in place; the OpenShift default ones plus a few custom ones; none SCC specific for Argo CD or ApplicationSet

$ oc get pods argocd-applicationset-controller-6dc5c45f9b-cnlzx -o yaml | grep -A7 security
securityContext:
  capabilities:
    drop:
    - KILL
    - MKNOD
    - SETGID
    - SETUID
  runAsUser: 1000860000
--
securityContext:
fsGroup: 1000860000
seLinuxOptions:
  level: s0:c29,c24
serviceAccount: argocd-applicationset-controller
serviceAccountName: argocd-applicationset-controller
terminationGracePeriodSeconds: 30
tolerations:

on a worker node

sh-4.4# cat /proc/sys/kernel/pid_max
4194304
sh-4.4# grep -ni pid /etc/kubernetes/kubelet.conf
53:    "SupportPodPidsLimit": true

no PID limit as parameter of kubelet in the process list

accidentially I have seen an argocd-applicationset-controller Pod in Terminating state and a new one starting Checking some clusters with a bit more activity I have found one where such an automated restart must have happened:

$ oc get pods
NAME                                                READY   STATUS    RESTARTS   AGE
argocd-application-controller-0                     1/1     Running   0          15d
argocd-applicationset-controller-745c57dcb4-5wg5x   1/1     Running   0          104s
argocd-dex-server-596f45b989-wsqdf                  1/1     Running   0          15d
argocd-redis-68dd9cbdb5-4n94c                       1/1     Running   0          13d
argocd-repo-server-5c77996f94-l5rgr                 1/1     Running   0          15d
argocd-server-7886dd956b-lq44z                      1/1     Running   0          15d

Now even on the same cluster some moments later:

$ oc get pods | grep controller
argocd-application-controller-0                     1/1     Running   0          15d
argocd-applicationset-controller-5648f488b4-2zjp8   1/1     Running   0          2m51s
argocd-applicationset-controller-666dd8bb58-lg986   1/1     Running   0          38s

The older one of the ApplicationSet Controllers then got terminated. Creation of a new Pod and terminating the previous one happens frequently on that cluster (to my surprise).

as well on the cluster with the frequent Pod recreations (<5min) and another active cluster I do not see many git processes; now and then a single git by observing watch ps -ef which quickly is gone
on the cluster where the first event of a non-functions applicationset-controller Pod occured some time ago the Pod still lives and operates argocd-applicationset-controller-58cc5559b9-2b6nc 1/1 Running 0 26h on the cluster where the applicationset-controller Pod barely lives for 5 minutes I spotted in the Pod log an error resulting from a dev team having produced an invalid kubernetes object name - would that cause frequent Pod recreations?

OK, everything there sounds correct, and it doesn't seem like anything you describe above would cause the issue.

Is it possible that some other container/OS process on the worker node is hogging all the processes? I'm not a Kubernetes/OpenShift admin so I don't know if there is an official way of checking this, beyond just SSHing into the underlying worker node and doing sudo ps -ef | wc -l to count the number of processes :thinking: .

Beyond this, I'm not sure what else to suggest. It's the operating system that is blocking the fork, and thus it's either OpenShift/Kubernetes, or the actual underlying Linux Kernel/OS that's returning that error, and it doesn't seem like the ApplicationSet controller is contributing to the issue (at least as far as you have described thus far).

Hi @jgwest, while trying to find out which load is hogging the node ressources we like to implement a detection mechanism to autorestart the applicationset-controller Pod in such a case. Ideally not on the same node. applicationset-controller --help states

  -metrics-addr string
        The address the metric endpoint binds to. (default ":8080")

as a possible parameter. Would that allow for a liveness probe? Or -probe-addr string?

Those values are exposed, but my understanding is they came from kubebuilder template, and haven't had a lot of love since then, so they aren't guaranteed to actually work as expected :smile: . For example: https://github.com/argoproj-labs/applicationset/issues/123

If you do have success using them, I'd be interested to hear.

Hi @jgwest, I hope we can reactivate activity on this issue as it persists with ApplicationSet Controller v0.2.0. Do you have a hint how to debug the situation?

To provide an example output of the process table from within the applicationset-controller Pod which gathers zombie processes up to the defined limit:

1000680000@argocd-applicationset-controller-55b7874c46-s4v9x:/$ ps axuwwww
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1000680+       1  0.9  0.0 770824 100520 ?       Ssl  07:05   3:52 applicationset-controller --loglevel warn
1000680+     344  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     368  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     409  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     412  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     415  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     418  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     421  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     424  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     427  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     430  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     433  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     436  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     439  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     442  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     461  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     464  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     467  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     470  0.0  0.0      0     0 ?        Z    07:05   0:00 [sh] <defunct>
1000680+     473  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     476  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     487  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     490  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     493  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     496  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     499  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     502  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     505  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     508  0.0  0.0      0     0 ?        Z    07:06   0:00 [sh] <defunct>
1000680+     686  0.0  0.0      0     0 ?        Z    07:07   0:00 [sh] <defunct>
1000680+     689  0.0  0.0      0     0 ?        Z    07:07   0:00 [sh] <defunct>
1000680+     772  0.0  0.0      0     0 ?        Z    07:08   0:00 [sh] <defunct>
1000680+     775  0.0  0.0      0     0 ?        Z    07:08   0:00 [sh] <defunct>
1000680+    1280  0.0  0.0      0     0 ?        Z    07:11   0:00 [sh] <defunct>
1000680+    1283  0.0  0.0      0     0 ?        Z    07:11   0:00 [sh] <defunct>
1000680+    1603  0.0  0.0   5776  1972 pts/0    Ss   07:14   0:00 sh -i -c TERM=xterm sh
1000680+    1609  0.0  0.0   5776  1920 pts/0    S    07:14   0:00 sh
1000680+    1622  0.0  0.0   7612  4416 pts/0    S+   07:14   0:00 /bin/bash
1000680+    1972  0.0  0.0      0     0 ?        Z    07:16   0:00 [sh] <defunct>
1000680+    1975  0.0  0.0      0     0 ?        Z    07:16   0:00 [sh] <defunct>
1000680+    3072  0.0  0.0      0     0 ?        Z    07:27   0:00 [sh] <defunct>
1000680+    3075  0.0  0.0      0     0 ?        Z    07:27   0:00 [sh] <defunct>
1000680+    4821  0.0  0.0      0     0 ?        Z    07:44   0:00 [sh] <defunct>
1000680+    4824  0.0  0.0      0     0 ?        Z    07:44   0:00 [sh] <defunct>
1000680+    7050  0.0  0.0      0     0 ?        Z    08:01   0:00 [sh] <defunct>
1000680+    7053  0.0  0.0      0     0 ?        Z    08:01   0:00 [sh] <defunct>
1000680+    9100  0.0  0.0      0     0 ?        Z    08:17   0:00 [sh] <defunct>
1000680+    9103  0.0  0.0      0     0 ?        Z    08:17   0:00 [sh] <defunct>
1000680+   10798  0.0  0.0      0     0 ?        Z    08:34   0:00 [sh] <defunct>
1000680+   10801  0.0  0.0      0     0 ?        Z    08:34   0:00 [sh] <defunct>
1000680+   12395  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
1000680+   12398  0.0  0.0      0     0 ?        Z    08:51   0:00 [sh] <defunct>
1000680+   14992  0.0  0.0      0     0 ?        Z    09:07   0:00 [sh] <defunct>
1000680+   14995  0.0  0.0      0     0 ?        Z    09:07   0:00 [sh] <defunct>
1000680+   16652  0.0  0.0      0     0 ?        Z    09:24   0:00 [sh] <defunct>
1000680+   16655  0.0  0.0      0     0 ?        Z    09:24   0:00 [sh] <defunct>
1000680+   18516  0.0  0.0      0     0 ?        Z    09:41   0:00 [sh] <defunct>
1000680+   18527  0.0  0.0      0     0 ?        Z    09:41   0:00 [sh] <defunct>
1000680+   20639  0.0  0.0      0     0 ?        Z    09:57   0:00 [sh] <defunct>
1000680+   20642  0.0  0.0      0     0 ?        Z    09:57   0:00 [sh] <defunct>
1000680+   22363  0.0  0.0      0     0 ?        Z    10:14   0:00 [sh] <defunct>
1000680+   22366  0.0  0.0      0     0 ?        Z    10:14   0:00 [sh] <defunct>
1000680+   24213  0.0  0.0      0     0 ?        Z    10:31   0:00 [sh] <defunct>
1000680+   24216  0.0  0.0      0     0 ?        Z    10:31   0:00 [sh] <defunct>
1000680+   25872  0.0  0.0      0     0 ?        Z    10:48   0:00 [sh] <defunct>
1000680+   25875  0.0  0.0      0     0 ?        Z    10:48   0:00 [sh] <defunct>
1000680+   27696  0.0  0.0      0     0 ?        Z    11:04   0:00 [sh] <defunct>
1000680+   27699  0.0  0.0      0     0 ?        Z    11:04   0:00 [sh] <defunct>
1000680+   29260  0.0  0.0      0     0 ?        Z    11:21   0:00 [sh] <defunct>
1000680+   29263  0.0  0.0      0     0 ?        Z    11:21   0:00 [sh] <defunct>
1000680+   31056  0.0  0.0      0     0 ?        Z    11:38   0:00 [sh] <defunct>
1000680+   31059  0.0  0.0      0     0 ?        Z    11:38   0:00 [sh] <defunct>
1000680+   32839  0.0  0.0      0     0 ?        Z    11:54   0:00 [sh] <defunct>
1000680+   32842  0.0  0.0      0     0 ?        Z    11:54   0:00 [sh] <defunct>
1000680+   34811  0.0  0.0      0     0 ?        Z    12:11   0:00 [sh] <defunct>
1000680+   34814  0.0  0.0      0     0 ?        Z    12:11   0:00 [sh] <defunct>
1000680+   36391  0.0  0.0      0     0 ?        Z    12:28   0:00 [sh] <defunct>
1000680+   36394  0.0  0.0      0     0 ?        Z    12:28   0:00 [sh] <defunct>
1000680+   37163  0.0  0.0      0     0 ?        Z    12:32   0:00 [sh] <defunct>
1000680+   37174  0.0  0.0      0     0 ?        Z    12:32   0:00 [sh] <defunct>
1000680+   38408  0.0  0.0      0     0 ?        Z    12:44   0:00 [sh] <defunct>
1000680+   38411  0.0  0.0      0     0 ?        Z    12:44   0:00 [sh] <defunct>
1000680+   40174  0.0  0.0      0     0 ?        Z    13:01   0:00 [sh] <defunct>
1000680+   40177  0.0  0.0      0     0 ?        Z    13:01   0:00 [sh] <defunct>
1000680+   41909  0.0  0.0      0     0 ?        Z    13:18   0:00 [sh] <defunct>
1000680+   41912  0.0  0.0      0     0 ?        Z    13:18   0:00 [sh] <defunct>
1000680+   43791  0.0  0.0      0     0 ?        Z    13:34   0:00 [sh] <defunct>
1000680+   43794  0.0  0.0      0     0 ?        Z    13:34   0:00 [sh] <defunct>
1000680+   45531  0.0  0.0      0     0 ?        Z    13:51   0:00 [sh] <defunct>
1000680+   45534  0.0  0.0      0     0 ?        Z    13:51   0:00 [sh] <defunct>
1000680+   46854  0.0  0.0   7612  4620 pts/1    Ss   14:01   0:00 bash
1000680+   47650  0.0  0.0      0     0 ?        Z    14:08   0:00 [sh] <defunct>
1000680+   47653  0.0  0.0      0     0 ?        Z    14:08   0:00 [sh] <defunct>
1000680+   47817  0.0  0.0   9984  3820 pts/1    R+   14:10   0:00 ps axuwwww
1000680000@argocd-applicationset-controller-55b7874c46-s4v9x:/$

PPID of the zombies is PID 1.

The limit:

sh-4.4# crio config | egrep ‘pids_limit’
pids_limit = 1024

We access our Bitbucket hosted git repos via SSH.

@jgwest we are a step further in debugging the situation. We are able to correlate the occurance of the defunct processes with the occurance of the log messages

level=error msg="error generating application from params" error="Error during fetching repo:git fetch origin master --tags --forcefailed exit status 128:

by 100%. The git fetch errors result from different situations where the git repo branch is not fetchable. For instance when we have provisioned the application via ApplicationSet for the team including generated the GitOps repo for the Helm Chart while the team hasn't so far populated their repo branch.

Am I correct that you fully rely on the https://github.com/go-git/go-git/ code and that the behaviour would have to be fixed there?

argoproj / applicationset

argocd-applicationset-controller Pod loses ssh-known-hosts ConfigMap mount #338