Open alexanderdalloz opened 3 years ago
Hi @alexanderdalloz, I think the error message isn't related to ssh_known_hosts
, it's actually just LOOKS like it is based on the error output format. The actual error message section is this: failed exit status 128: error: cannot fork() for ssh
(more info here)
When the git command runs, it creates a new ssh
process in order to connect to the Git repository via SSH, and that is what this error message is related to. It is saying that for whatever reason it is unable to create a new process.
A few thoughts:
ps -ef
, how many git processes does it show? (I'm hoping there aren't a bunch of leftover git processes)Hi @jgwest, please let me answer your questions:
yes, SCCs are in place; the OpenShift default ones plus a few custom ones; none SCC specific for Argo CD or ApplicationSet
$ oc get pods argocd-applicationset-controller-6dc5c45f9b-cnlzx -o yaml | grep -A7 security
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
runAsUser: 1000860000
--
securityContext:
fsGroup: 1000860000
seLinuxOptions:
level: s0:c29,c24
serviceAccount: argocd-applicationset-controller
serviceAccountName: argocd-applicationset-controller
terminationGracePeriodSeconds: 30
tolerations:
on a worker node
sh-4.4# cat /proc/sys/kernel/pid_max
4194304
sh-4.4# grep -ni pid /etc/kubernetes/kubelet.conf
53: "SupportPodPidsLimit": true
no PID limit as parameter of kubelet in the process list
accidentially I have seen an argocd-applicationset-controller Pod in Terminating state and a new one starting Checking some clusters with a bit more activity I have found one where such an automated restart must have happened:
$ oc get pods
NAME READY STATUS RESTARTS AGE
argocd-application-controller-0 1/1 Running 0 15d
argocd-applicationset-controller-745c57dcb4-5wg5x 1/1 Running 0 104s
argocd-dex-server-596f45b989-wsqdf 1/1 Running 0 15d
argocd-redis-68dd9cbdb5-4n94c 1/1 Running 0 13d
argocd-repo-server-5c77996f94-l5rgr 1/1 Running 0 15d
argocd-server-7886dd956b-lq44z 1/1 Running 0 15d
Now even on the same cluster some moments later:
$ oc get pods | grep controller
argocd-application-controller-0 1/1 Running 0 15d
argocd-applicationset-controller-5648f488b4-2zjp8 1/1 Running 0 2m51s
argocd-applicationset-controller-666dd8bb58-lg986 1/1 Running 0 38s
The older one of the ApplicationSet Controllers then got terminated. Creation of a new Pod and terminating the previous one happens frequently on that cluster (to my surprise).
as well on the cluster with the frequent Pod recreations (<5min) and another active cluster I do not see many git processes; now and then a single git by observing
watch ps -ef
which quickly is gone
on the cluster where the first event of a non-functions applicationset-controller Pod occured some time ago the Pod still lives and operates
argocd-applicationset-controller-58cc5559b9-2b6nc 1/1 Running 0 26h
on the cluster where the applicationset-controller Pod barely lives for 5 minutes I spotted in the Pod log an error resulting from a dev team having produced an invalid kubernetes object name - would that cause frequent Pod recreations?
OK, everything there sounds correct, and it doesn't seem like anything you describe above would cause the issue.
Is it possible that some other container/OS process on the worker node is hogging all the processes? I'm not a Kubernetes/OpenShift admin so I don't know if there is an official way of checking this, beyond just SSHing into the underlying worker node and doing sudo ps -ef | wc -l
to count the number of processes :thinking: .
Beyond this, I'm not sure what else to suggest. It's the operating system that is blocking the fork, and thus it's either OpenShift/Kubernetes, or the actual underlying Linux Kernel/OS that's returning that error, and it doesn't seem like the ApplicationSet controller is contributing to the issue (at least as far as you have described thus far).
Hi @jgwest, while trying to find out which load is hogging the node ressources we like to implement a detection mechanism to autorestart the applicationset-controller Pod in such a case. Ideally not on the same node.
applicationset-controller --help
states
-metrics-addr string
The address the metric endpoint binds to. (default ":8080")
as a possible parameter. Would that allow for a liveness probe? Or -probe-addr string
?
Those values are exposed, but my understanding is they came from kubebuilder
template, and haven't had a lot of love since then, so they aren't guaranteed to actually work as expected :smile: . For example: https://github.com/argoproj-labs/applicationset/issues/123
If you do have success using them, I'd be interested to hear.
Hi @jgwest, I hope we can reactivate activity on this issue as it persists with ApplicationSet Controller v0.2.0. Do you have a hint how to debug the situation?
To provide an example output of the process table from within the applicationset-controller Pod which gathers zombie processes up to the defined limit:
1000680000@argocd-applicationset-controller-55b7874c46-s4v9x:/$ ps axuwwww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1000680+ 1 0.9 0.0 770824 100520 ? Ssl 07:05 3:52 applicationset-controller --loglevel warn
1000680+ 344 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 368 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 409 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 412 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 415 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 418 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 421 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 424 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 427 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 430 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 433 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 436 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 439 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 442 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 461 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 464 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 467 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 470 0.0 0.0 0 0 ? Z 07:05 0:00 [sh] <defunct>
1000680+ 473 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 476 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 487 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 490 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 493 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 496 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 499 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 502 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 505 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 508 0.0 0.0 0 0 ? Z 07:06 0:00 [sh] <defunct>
1000680+ 686 0.0 0.0 0 0 ? Z 07:07 0:00 [sh] <defunct>
1000680+ 689 0.0 0.0 0 0 ? Z 07:07 0:00 [sh] <defunct>
1000680+ 772 0.0 0.0 0 0 ? Z 07:08 0:00 [sh] <defunct>
1000680+ 775 0.0 0.0 0 0 ? Z 07:08 0:00 [sh] <defunct>
1000680+ 1280 0.0 0.0 0 0 ? Z 07:11 0:00 [sh] <defunct>
1000680+ 1283 0.0 0.0 0 0 ? Z 07:11 0:00 [sh] <defunct>
1000680+ 1603 0.0 0.0 5776 1972 pts/0 Ss 07:14 0:00 sh -i -c TERM=xterm sh
1000680+ 1609 0.0 0.0 5776 1920 pts/0 S 07:14 0:00 sh
1000680+ 1622 0.0 0.0 7612 4416 pts/0 S+ 07:14 0:00 /bin/bash
1000680+ 1972 0.0 0.0 0 0 ? Z 07:16 0:00 [sh] <defunct>
1000680+ 1975 0.0 0.0 0 0 ? Z 07:16 0:00 [sh] <defunct>
1000680+ 3072 0.0 0.0 0 0 ? Z 07:27 0:00 [sh] <defunct>
1000680+ 3075 0.0 0.0 0 0 ? Z 07:27 0:00 [sh] <defunct>
1000680+ 4821 0.0 0.0 0 0 ? Z 07:44 0:00 [sh] <defunct>
1000680+ 4824 0.0 0.0 0 0 ? Z 07:44 0:00 [sh] <defunct>
1000680+ 7050 0.0 0.0 0 0 ? Z 08:01 0:00 [sh] <defunct>
1000680+ 7053 0.0 0.0 0 0 ? Z 08:01 0:00 [sh] <defunct>
1000680+ 9100 0.0 0.0 0 0 ? Z 08:17 0:00 [sh] <defunct>
1000680+ 9103 0.0 0.0 0 0 ? Z 08:17 0:00 [sh] <defunct>
1000680+ 10798 0.0 0.0 0 0 ? Z 08:34 0:00 [sh] <defunct>
1000680+ 10801 0.0 0.0 0 0 ? Z 08:34 0:00 [sh] <defunct>
1000680+ 12395 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
1000680+ 12398 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
1000680+ 14992 0.0 0.0 0 0 ? Z 09:07 0:00 [sh] <defunct>
1000680+ 14995 0.0 0.0 0 0 ? Z 09:07 0:00 [sh] <defunct>
1000680+ 16652 0.0 0.0 0 0 ? Z 09:24 0:00 [sh] <defunct>
1000680+ 16655 0.0 0.0 0 0 ? Z 09:24 0:00 [sh] <defunct>
1000680+ 18516 0.0 0.0 0 0 ? Z 09:41 0:00 [sh] <defunct>
1000680+ 18527 0.0 0.0 0 0 ? Z 09:41 0:00 [sh] <defunct>
1000680+ 20639 0.0 0.0 0 0 ? Z 09:57 0:00 [sh] <defunct>
1000680+ 20642 0.0 0.0 0 0 ? Z 09:57 0:00 [sh] <defunct>
1000680+ 22363 0.0 0.0 0 0 ? Z 10:14 0:00 [sh] <defunct>
1000680+ 22366 0.0 0.0 0 0 ? Z 10:14 0:00 [sh] <defunct>
1000680+ 24213 0.0 0.0 0 0 ? Z 10:31 0:00 [sh] <defunct>
1000680+ 24216 0.0 0.0 0 0 ? Z 10:31 0:00 [sh] <defunct>
1000680+ 25872 0.0 0.0 0 0 ? Z 10:48 0:00 [sh] <defunct>
1000680+ 25875 0.0 0.0 0 0 ? Z 10:48 0:00 [sh] <defunct>
1000680+ 27696 0.0 0.0 0 0 ? Z 11:04 0:00 [sh] <defunct>
1000680+ 27699 0.0 0.0 0 0 ? Z 11:04 0:00 [sh] <defunct>
1000680+ 29260 0.0 0.0 0 0 ? Z 11:21 0:00 [sh] <defunct>
1000680+ 29263 0.0 0.0 0 0 ? Z 11:21 0:00 [sh] <defunct>
1000680+ 31056 0.0 0.0 0 0 ? Z 11:38 0:00 [sh] <defunct>
1000680+ 31059 0.0 0.0 0 0 ? Z 11:38 0:00 [sh] <defunct>
1000680+ 32839 0.0 0.0 0 0 ? Z 11:54 0:00 [sh] <defunct>
1000680+ 32842 0.0 0.0 0 0 ? Z 11:54 0:00 [sh] <defunct>
1000680+ 34811 0.0 0.0 0 0 ? Z 12:11 0:00 [sh] <defunct>
1000680+ 34814 0.0 0.0 0 0 ? Z 12:11 0:00 [sh] <defunct>
1000680+ 36391 0.0 0.0 0 0 ? Z 12:28 0:00 [sh] <defunct>
1000680+ 36394 0.0 0.0 0 0 ? Z 12:28 0:00 [sh] <defunct>
1000680+ 37163 0.0 0.0 0 0 ? Z 12:32 0:00 [sh] <defunct>
1000680+ 37174 0.0 0.0 0 0 ? Z 12:32 0:00 [sh] <defunct>
1000680+ 38408 0.0 0.0 0 0 ? Z 12:44 0:00 [sh] <defunct>
1000680+ 38411 0.0 0.0 0 0 ? Z 12:44 0:00 [sh] <defunct>
1000680+ 40174 0.0 0.0 0 0 ? Z 13:01 0:00 [sh] <defunct>
1000680+ 40177 0.0 0.0 0 0 ? Z 13:01 0:00 [sh] <defunct>
1000680+ 41909 0.0 0.0 0 0 ? Z 13:18 0:00 [sh] <defunct>
1000680+ 41912 0.0 0.0 0 0 ? Z 13:18 0:00 [sh] <defunct>
1000680+ 43791 0.0 0.0 0 0 ? Z 13:34 0:00 [sh] <defunct>
1000680+ 43794 0.0 0.0 0 0 ? Z 13:34 0:00 [sh] <defunct>
1000680+ 45531 0.0 0.0 0 0 ? Z 13:51 0:00 [sh] <defunct>
1000680+ 45534 0.0 0.0 0 0 ? Z 13:51 0:00 [sh] <defunct>
1000680+ 46854 0.0 0.0 7612 4620 pts/1 Ss 14:01 0:00 bash
1000680+ 47650 0.0 0.0 0 0 ? Z 14:08 0:00 [sh] <defunct>
1000680+ 47653 0.0 0.0 0 0 ? Z 14:08 0:00 [sh] <defunct>
1000680+ 47817 0.0 0.0 9984 3820 pts/1 R+ 14:10 0:00 ps axuwwww
1000680000@argocd-applicationset-controller-55b7874c46-s4v9x:/$
PPID of the zombies is PID 1.
The limit:
sh-4.4# crio config | egrep ‘pids_limit’
pids_limit = 1024
We access our Bitbucket hosted git repos via SSH.
@jgwest we are a step further in debugging the situation. We are able to correlate the occurance of the defunct processes with the occurance of the log messages
level=error msg="error generating application from params" error="Error during fetching repo:
git fetch origin master --tags --forcefailed exit status 128:
by 100%. The git fetch errors result from different situations where the git repo branch is not fetchable. For instance when we have provisioned the application via ApplicationSet for the team including generated the GitOps repo for the Helm Chart while the team hasn't so far populated their repo branch.
Am I correct that you fully rely on the https://github.com/go-git/go-git/ code and that the behaviour would have to be fixed there?
Today we faced the second time on a different cluster than before the case that the argocd-applicationset-controller v0.1.0 Pod has lost its ConfigMap mount of the ssh-known-hosts.
Following is being logged by the Pod in such case:
No mass event of such type of defect seen for other applications. All other Pods of Argo CD on the same cluster do not have that issue at the same time, i.e. the argocd-repo-server has the same ConfigMap mounted runs just fine. Runs sind 15 days and has 18k log lines available by now.