akuity / kargo

Application lifecycle orchestration
https://kargo.akuity.io/
Apache License 2.0
1.73k stars 143 forks source link

kargo-controller creates zombie [git] processes #2926

Open moro-drake opened 4 days ago

moro-drake commented 4 days ago

Checklist

Description

We use kargo in openshift cluster. Openshift runs it with the following user: runAsUser: 1001910000 (ps output for this user attached as screenshots). Since 1.0.3 update we have noticed it creates zombie processes [git]. Those process slowly bulk up and make controller unusable (unix fork can't create more processes). Warehouse set to discover new tags using NewestTag strategy speeds up this process (you can see a 'zombie' spawn on each refresh of WH in the UI). With 1m interval and 20 discovery limit kargo-controller was dead in half a day (we had about 44 active WHs with 'NewestTag' subscription)

Screenshots

image image image

Steps to Reproduce

  1. Create a Kargo project with a stage and warehouse subscription to git
  2. Set the WH subscription spec like this:
    spec:
    freightCreationPolicy: Automatic
    interval: 1m0s
    subscriptions:
    - git:
        branch: main
        commitSelectionStrategy: NewestTag
        discoveryLimit: 20
  3. Open terminal to node running the kargo-controller and do a ps auxf | grep 'defunct' | wc -l
  4. Refresh the WH that subscribes to NewestTag in the UI
  5. Run the ps auxf | grep 'defunct' | wc -l - count will increment as new 'zombie' has spawned.

Version

Kargo v1.0.3

Logs

time="2024-11-14T03:06:08Z" level=error msg="Reconciler error" Warehouse="{release-warehouse dex}" controller=warehouse controllerGroup=kargo.akuity.io controllerKind=Warehouse error="error discovering artifacts: error discovering commits: failed to clone git repo \"https://repo.git\": error cloning repo \"https://repo.git\" into \"/tmp/repo-223330204/repo\": error executing cmd [/usr/bin/git clone --no-tags --branch main --single-branch https://repo.git /tmp/repo-223330204/repo]: Cloning into '/tmp/repo-223330204/repo'...\nerror: cannot fork() for remote-https: Resource temporarily unavailable\n" name=release-warehouse namespace=dex reconcileID="\"f2e1b0c6-3ad5-4d40-9965-fdda5c8b1a7d\""
hiddeco commented 3 hours ago

This is an interesting issue as we appear to make use of Exec everywhere where we call git, which in turn uses cmd.CombinedOutput() that makes sure to .Wait() (which is normally a reason for zombie processes).

Needs a more thorough investigation.