krausemi commented 1 year ago

Introduction

We are using Argo CD in combination with the argocd-vault-plugin (https://github.com/argoproj-labs/argocd-vault-plugin). The plugin has been installed via the sidecar-container way and is working as it should. But somehow it looks like the argocd-cmp-server is not correctly terminating the executed bash commands after execution. They're still there... as zombies.

Bug description

After the execution of the defined generate process for helm charts (avp-helm) the executed bash sub-processes are stuck in state "defunct" instead of being terminated.

The number of zombie processes is increasing rapidly and after some hours the process limits within the underlying node gets reached. By reaching the limit the node itself is unusable.

Logs

time="2023-02-01T14:35:34Z" level=info msg="sh -c find . -name 'Chart.yaml' && find . -name 'values.yaml'" dir=/tmp/_cmp_server/98b7b460-f3d6-46e0-af3c-8f1585a50f70 execID=4e02c
time="2023-02-01T14:35:34Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=MatchRepository grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2023-02-01T14:35:34Z" grpc.time_ms=10.849 span.kind=server system=grpc
time="2023-02-01T14:35:34Z" level=info msg="Generating manifests with no request-level timeout"
time="2023-02-01T14:35:34Z" level=info msg="bash -c helm template $ARGOCD_APP_NAME -n $ARGOCD_APP_NAMESPACE ${ARGOCD_ENV_HELM_ARGS} ${ARGOCD_ENV_HELM_OPTIONS} -f <(echo \"$ARGOCD_ENV_HELM_VALUES\") . |\nargocd-vault-plugin generate -s infrastructure:argocd-vault-plugin-credentials -\n" dir=/tmp/_cmp_server/7e16abb9-c649-49ef-af46-15e5e3d3a716 execID=562f5
time="2023-02-01T14:35:34Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=GenerateManifest grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2023-02-01T14:35:34Z" grpc.time_ms=78.883 span.kind=server system=grpc

Using the --verbose-sensitive-output parameter did not log more than the logs above (or I did something wrong :D).

Installation setup

Used Dockerfile for image creation

ARG ARGOCD_VERSION=2.5.7
ARG AVP_VERSION=1.13.1

FROM registry.access.redhat.com/ubi8 as download

RUN mkdir /custom-tools/ && \
    cd /custom-tools/ && \
    curl -L https://github.com/argoproj-labs/argocd-vault-plugin/releases/download/v1.13.1/argocd-vault-plugin_1.13.1_linux_amd64 -o argocd-vault-plugin && \
    chmod +x argocd-vault-plugin

FROM quay.io/argoproj/argocd:v${ARGOCD_VERSION} as target

COPY certs.crt /etc/ssl/certs/
COPY --from=download /custom-tools/argocd-vault-plugin /usr/local/bin/

Used values for argocd-vault-plugin sidecar installation

extraContainers:
  - name: avp-helm
    command: [/var/run/argocd/argocd-cmp-server]
    image: <internal-registry>/path/to/image/argocd-vault-plugin-sidecar:<internal tag>
    securityContext:
      runAsNonRoot: true
      runAsUser: 999
    volumeMounts:
      - mountPath: /var/run/argocd
        name: var-files
      - mountPath: /home/argocd/cmp-server/plugins
        name: plugins
      - mountPath: /tmp
        name: tmp-dir
      - mountPath: /home/argocd/cmp-server/config/plugin.yaml
        subPath: avp-helm.yaml
        name: cmp-plugin

Used config for the avp-helm-sidecar-container

apiVersion: argoproj.io/v1alpha1
kind: ConfigManagementPlugin
metadata:
  name: argocd-vault-plugin-helm
spec:
  allowConcurrency: true
  # Note: this command is run _before_ any Helm templating is done, therefore the logic is to check
  # if this looks like a Helm chart
  discover:
    find:
      command:
        - sh
        - "-c"
        - "find . -name 'Chart.yaml' && find . -name 'values.yaml'"
  generate:
    command:
      - bash
      - "-c"
      - |
        helm template $ARGOCD_APP_NAME -n $ARGOCD_APP_NAMESPACE ${ARGOCD_ENV_HELM_ARGS}     
        ${ARGOCD_ENV_HELM_OPTIONS} -f <(echo "$ARGOCD_ENV_HELM_VALUES") . | argocd-vault-plugin generate -s 
        infrastructure:argocd-vault-plugin-credentials -
  lockRepo: false

ArgoCD example application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: dex
spec:
  destination:
    namespace: infrastructure
    server: "https://kubernetes.default.svc"
  project: default
source:
  repoURL: <internal-registry>/path/to/chart
  chart: dex
  targetRevision: 0.9.0
  plugin:
    env:
      - name: HELM_VALUES
        value: >-
          replicaCount: 1
          image:
            repository: <internal-registry>/path/to/image/dex
            pullPolicy: IfNotPresent
            tag: ""

          imagePullSecrets:
            - name: container-pullsecret
          config:
            staticClients:
              - id: grafana-client
              secret: "<path:cluster/data/testcluster/dex#grafana-client>"
              name: 'Grafana'
              redirectURIs:
              - https://grafana.domain/login/generic_oauth
      - name: HELM_OPTIONS
        value: '--include-crds'
syncPolicy:
  automated:
    selfHeal: true
  syncOptions:
    - CreateNamespace=true
    - ApplyOutOfSyncOnly=true

How to reproduce

Deploy argocd together with the argocd-vault-plugin by using the above configuration.
Deploy an application which contains avp placeholders like in the example above.
Connect to the avp-helm container and list the running processes by executing: kubectl exec -it -n <namespace> <pod name> -c <container name (in my case avp-helm)> -- bash -c "ps -ef | head"
Deploy some more helm charts with avp placeholders or update already present applications so that the avp-helm-sidecar has something to do.
Check the number of running sub-processes in state "defunct" and count them, e.g. by using wc -l after listing.
Wait a little and see how the number is growing rapidly. The more apps are deployed, the more zombie processes will be available in a short matter of time.

Expected behavior

I would expect that the sub-processes of the argocd-cmp-server, which are spawned within the avp-helm-sidecar-container, will be terminated after execution instead of being zombies.

Workaround

For a temporary workaround we implemented a cronjob which restarts the argocd-repo-server-pod (which contains the avp-helm-sidecar-container) on a daily basis. Therefore the spawned zombie processes will be killed and the node itself will not reach its process limit.

crenshaw-dev commented 1 year ago

Could be related, but I think this one only applies when requests time out: https://github.com/argoproj/argo-cd/issues/9180

gczuczy commented 1 year ago

As I see the sidecar has directly the cmp server as an entrypoint:

command: [/var/run/argocd/argocd-cmp-server]

If the entrypoint is an init process (such as tini), that takes care of cleaning up the leftover processes. That's the complete point of tini.

@krausemi Could you please try adding tini to your sidecar container, either setting it as ENTRYPOINT in the dockerfile or using it as the command for the sidecar, then move the argocd-cmp-server to the args section? This way tini will have pid0, and will kill the zombies, as that's its intended puprose.

krausemi commented 1 year ago

Thanks for the hint @gczuczy.

I'll give it a try and post the results here.

gczuczy commented 1 year ago

Thanks for the hint @gczuczy.

I'll give it a try and post the results here.

Here's a good reading on the pid1 init's responsibilities: https://github.com/krallin/tini/issues/8

krausemi commented 1 year ago

Okay, I've tried the solution with tini and it works as it should - there are no more zombie processes. :) I'll also update the parent issue within the argocd-vault-plugin project, so that they can adapt their documentation.

Thank you for your support!

What did I change to make it work?

Because of the fact, that the used argocd container image already has tini installed, I've just added the entrypoint to the used Dockerfile and handed over the arguments "/var/run/argocd/argocd-cmp-server" from within the extraContainers parameters inside the argocd values file.

Adapted Dockerfile for image creation

ARG ARGOCD_VERSION=2.5.7
ARG AVP_VERSION=1.13.1

FROM registry.access.redhat.com/ubi8 as download

RUN mkdir /custom-tools/ && \
    cd /custom-tools/ && \
    curl -L https://github.com/argoproj-labs/argocd-vault-plugin/releases/download/v${AVP_VERSION}/argocd-vault-plugin_{AVP_VERSION}_linux_amd64 -o argocd-vault-plugin && \
    chmod +x argocd-vault-plugin

FROM quay.io/argoproj/argocd:v${ARGOCD_VERSION} as target

COPY certs.crt /etc/ssl/certs/
COPY --from=download /custom-tools/argocd-vault-plugin /usr/local/bin/

ENTRYPOINT [ "/usr/bin/tini" ]

Adapted values for argocd-vault-plugin sidecar installation

extraContainers:
  - name: avp-helm
    args:
      - /var/run/argocd/argocd-cmp-server
    image: <internal-registry>/path/to/image/argocd-vault-plugin-sidecar:<internal tag>
    securityContext:
      runAsNonRoot: true
      runAsUser: 999
    volumeMounts:
      - mountPath: /var/run/argocd
        name: var-files
      - mountPath: /home/argocd/cmp-server/plugins
        name: plugins
      - mountPath: /tmp
        name: tmp-dir
      - mountPath: /home/argocd/cmp-server/config/plugin.yaml
        subPath: avp-helm.yaml
        name: cmp-plugin

argoproj / argo-cd

argocd-cmp-server does not terminate bash sub-processes from argocd-vault-plugin #13026

Introduction

Bug description

Logs

Installation setup

Used Dockerfile for image creation

Used values for argocd-vault-plugin sidecar installation

Used config for the avp-helm-sidecar-container

ArgoCD example application

How to reproduce

Expected behavior

Workaround

Adapted Dockerfile for image creation

Adapted values for argocd-vault-plugin sidecar installation