elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
121 stars 131 forks source link

The elastic agent watch command creates a zombie process each time it restarts #2190

Open cmacknz opened 1 year ago

cmacknz commented 1 year ago

Before we fixed https://github.com/elastic/beats/issues/34178 we had some reports of an excessive number of zombie processes being left on a machine. In one case a single MacOS machine was left with hundreds of zombie processes. Example attached below.

ps-aux.txt

It seems like there is a correlation between the Beats restarting themselves and zombie processes being left behind. We should investigate this and see if there is anything we should be doing to ensure these processes are properly cleaned up.

The path the Beats follow to exit in this case is here.

cmacknz commented 1 year ago

@blakerouse does anything come to mind on what could cause this? Clearly the Beats restarting all the time was causing the unusual number of zombies, but the exited processes should not have been hanging around as zombies.

cmacknz commented 1 year ago

We should always be calling Wait here to prevent this. Clearly there is some interaction we are missing here though. In 8.6.1 there shouldn't be as many zombie processes as reduced the number of times Beats exit, but leaving them behind would still be possible as we haven't changed anything in the process handling on the agent side.

Here's the call stack involved in the agent that leads to calling Wait():

https://github.com/elastic/elastic-agent/blob/fefe64fff4369bddd806fdccf691ef6213adab42/pkg/component/runtime/command.go#L334-L344

https://github.com/elastic/elastic-agent/blob/fefe64fff4369bddd806fdccf691ef6213adab42/pkg/component/runtime/command.go#L378-L389

https://github.com/elastic/elastic-agent/blob/fefe64fff4369bddd806fdccf691ef6213adab42/pkg/core/process/process.go#L114-L123

lucabelluccini commented 11 months ago

Other users are hitting this issue.

cmacknz commented 10 months ago

Do we have logs or diagnostics available when this is happening?

cmacknz commented 10 months ago

I reproduced this without trying just installing a standalone agent with the default policy on a fresh Ubuntu 22.04 VM.

craig_mackenzie@cmackenzie-ubuntu-test:~/elastic-agent-8.10.2-linux-x86_64$ ps aux | grep elastic
root        1860  1.5  1.3 1506628 53864 ?       Ssl  19:24   0:01 /opt/Elastic/Agent/elastic-agent
root        1866  0.0  0.0      0     0 ?        Zs   19:24   0:00 [elastic-agent] <defunct>
root        1874  0.4  2.8 1482120 115692 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/system/metrics-default
root        1884  0.3  2.7 1387768 108976 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/filestream-monitoring
root        1890  0.3  2.8 1408388 113792 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/Hk6rvk9TDibMPcDvpl0jkLE-qDsHWVYL.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/beat/metrics-monitoring
root        1902  0.2  2.8 1481864 115096 ?      Sl   19:24   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/akSPbdqgaHaTY0_J01-dsfYK6JpMz2zn.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/http/metrics-monitoring
craig_m+    1930  0.0  0.0   7004  2304 pts/0    S+   19:26   0:00 grep --color=auto elastic

elastic-agent-diagnostics-2023-10-05T19-56-15Z-00.zip

Diagnostics attached. I think the defunct process might be the upgrade watcher. I see it exiting as expected on startup:

{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":67},"message":"Upgrade Watcher started","process.pid":1866,"agent.version":"8.10.2","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-10-05T19:24:29.052Z","log.origin":{"file.name":"cmd/watch.go","file.line":75},"message":"update marker not present at '/opt/Elastic/Agent/data'","ecs.version":"1.6.0"}

At startup we launch the elastic-agent watch command as a sub-process:

https://github.com/elastic/elastic-agent/blob/ac2ef57b335ebca28216240508bd773d53009c7a/internal/pkg/agent/cmd/run.go#L214-L218

The way we launch it is:

https://github.com/elastic/elastic-agent/blob/ac2ef57b335ebca28216240508bd773d53009c7a/internal/pkg/agent/application/upgrade/rollback.go#L109-L136

We call os.Process.Release but the exec.Cmd.Start says we need to call Wait(). Perhaps this is what is causing this.

cmacknz commented 10 months ago

I can reproduce this installing the agent in an ubuntu:22.04 container as well. Each time I restart the agent I get another defunct process

root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# elastic-agent restart
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# ps aux | grep elastic-agent
root      2815  2.2  0.3 1653072 52056 pts/0   Sl   20:18   0:02 elastic-agent
root      2828  0.0  0.0      0     0 ?        Zs   20:18   0:00 [elastic-agent] <defunct>
root      2913  1.5  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
root      2925  9.5  0.7 1476232 107968 pts/0  Sl   20:20   0:00 /opt/Elastic/Agent/data/elastic-agent-b0c688/components/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E management.restart_on_output_change=true -E logging.level=info -E logging.to_stderr=true -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock -E path.data=/opt/Elastic/Agent/data/elastic-agent-b0c688/run/system/metrics-default
...
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# elastic-agent restart
root@55df25f47e9b:/elastic-agent-8.10.2-linux-arm64# ps aux | grep elastic-agent
root      2815  2.3  0.3 1726096 49660 pts/0   Sl   20:18   0:03 elastic-agent
root      2828  0.0  0.0      0     0 ?        Zs   20:18   0:00 [elastic-agent] <defunct>
root      2913  0.0  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
root      2987  3.0  0.0      0     0 ?        Zs   20:20   0:00 [elastic-agent] <defunct>
...
cmacknz commented 10 months ago

Confirmed this is the watcher, building the agent with the small patch below to stop starting the watcher at startup stops the creation of zombie processes:

diff --git a/internal/pkg/agent/cmd/run.go b/internal/pkg/agent/cmd/run.go
index 91d470ac0..3c39c98dc 100644
--- a/internal/pkg/agent/cmd/run.go
+++ b/internal/pkg/agent/cmd/run.go
@@ -212,10 +212,10 @@ func run(override cfgOverrider, testingMode bool, fleetInitTimeout time.Duration
        }

        // initiate agent watcher
-       if err := upgrade.InvokeWatcher(l); err != nil {
-               // we should not fail because watcher is not working
-               l.Error(errors.New(err, "failed to invoke rollback watcher"))
-       }
+       // if err := upgrade.InvokeWatcher(l); err != nil {
+       //      // we should not fail because watcher is not working
+       //      l.Error(errors.New(err, "failed to invoke rollback watcher"))
+       // }

        if allowEmptyPgp, _ := release.PGP(); allowEmptyPgp {
                l.Info("Elastic Agent has been built with security disabled. Elastic Agent will not verify signatures of upgrade artifact.")
(END)

We need to start the watcher, so this isn't a real solution. We'll probably need to wait on it. I also haven't explored whether we get a zombie process when the upgrade watcher is run across upgrades, since it needs to outlive its parent process and there's no way we could wait on it in that situation.

Nacho1973 commented 10 months ago

Hi elastic-team! We have multiples clusters for severals enviroments. versions 8.8.1 and 8.10.1

We have detected that on all RED-HAT machines that have an elastic-agent installed, versions 8.8.1 or 8.10.1, several processes appears in a defunct state.

For example, our hosts with elastic-agent version 8.8.1: [elastic@g99eksa738 ~]$ ps -fea | grep -i defunc root 1629 1351 0 Jun15 ? 00:00:00 [elastic-agent] root 1130324 1351 0 Sep19 ? 00:00:00 [elastic-agent] root 1130372 1351 0 Sep19 ? 00:00:00 [elastic-agent] elastic 1625525 1625493 0 08:52 pts/0 00:00:00 grep --color=auto -i defunc

after 1 restart still have 1 defunct : [elastic@g99eksa738 ~]$ sudo systemctl restart elastic-agent [elastic@g99eksa738 ~]$ ps -fea | grep -i defunc root 1625543 1625534 3 08:53 ? 00:00:00 [elastic-agent] elastic 1625608 1625493 0 08:53 pts/0 00:00:00 grep --color=auto -i defunc

Same at version 8.10.1: root 604199 604139 0 Jun13 ? 00:00:00 [elastic-agent] root 2548351 604139 0 Sep19 ? 00:00:00 [elastic-agent] root 2548423 604139 0 Sep19 ? 00:00:00 [elastic-agent] elastic 3048770 3048735 0 09:02 pts/0 00:00:00 grep --color=auto defunct [elastic@g99tnsa723 ~]$ sudo systemctl restart elastic-agent [elastic@g99tnsa723 ~]$ ps -fea | grep defunct root 3048801 3048787 2 09:02 ? 00:00:00 [elastic-agent] elastic 3048814 3048735 0 09:02 pts/0 00:00:00 grep --color=auto defunct

all hosts are VM ESXI most of them are REDHAT 8.5 All elastic-agents are fleetserver managed...

¿is there any solution for this issue? Thanks and best regards

cmacknz commented 10 months ago

You will get a defunct process every time the Elastic Agent restarts (and possibly each time it upgrades) as shown in https://github.com/elastic/elastic-agent/issues/2190#issuecomment-1749587919.

This is caused by a bug in the Elastic Agent and requires a code change to fix. Restarting the system or stopping and starting the Elastic Agent service will likely clear these defunct processes but they will just come back as the next time the agent restarts.

pchila commented 3 months ago

@cmacknz what about launching the watcher by forking twice? I didn't try to implement that in Go but that's how a process "daemonize" on linux... In short we would have 3 processes involved in launching the watcher:

  1. elastic-agent main process: launches child process n.2 and waits for it
  2. elastic-agent "child" process: the only job of this is to call setsid() to become a process group leader, then to fork again launching process 3. and then exiting (it will be reaped by the main process)
  3. elastic-agent "grandchild" process: this is the actual watcher process and (in theory) it should be reaped properly by init/systemd process (PID 1), not leaving a zombie process behind...

I think we could experiment with it and check if we don't leave zombie processes this way...

Edit: forgot about the setsid() call needed to enter a new session

cmacknz commented 3 months ago

Seems like it might be more challenging than it seems:

We can not use fork syscall in Golang's runtime, because child process doesn't inherit threads and goroutines in that case. The library uses a simple trick: it runs its own copy with a mark - a predefined environment variable. Availability of the variable for the process means an execution in the child's copy. So that if the mark is not setted - the library executes parent's operations and runs its own copy with mark, and if the mark is setted - the library executes child's operations:

zez3 commented 3 months ago

Please see the solution in https://segmentfault.com/a/1190000041466423/en and https://stackoverflow.com/questions/59410139/run-command-in-golang-and-detach-it-from-process

Both solutions use cmd.wait() I hope it helps

blakerouse commented 3 months ago

I think what we need to determine here is if there is only a zombie because of reexec or if there is always a zombie in the case that the watcher is started and the parent exits.

I believe its only because of the reexec process where we re-use the same process space so that parent -> child reference still exists.

catalindevops commented 3 months ago

Are all versions affected by this ? I also noticed in my environments happening.

cmacknz commented 3 months ago

Yes, as far as I can tell this behavior has likely been here since agent was released. We have a child process that outlives its parent process' ability to call wait on it as part of our upgrade design.

zez3 commented 3 months ago

Are all versions affected by this ? I also noticed in my environments happening.

Just the linux OS

elasticmachine commented 3 months ago

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

cmacknz commented 1 month ago

This will be less frequent in 8.15.0, there will no longer be a defunct process every time the Elastic Agent process restarts. There will likely still be a defunct process each time the Elastic Agent is upgraded, because of the way our upgrade logic is implemented. Fixing that properly is more complex than just calling Wait unfortunately.