Long delay between EA received input and loaded

carsonip commented 1 week ago

APM Server version (apm-server version): 8.14.3

Description of the problem including expected versus actual behavior:

In EA managed apm-server, there are observations that there can be delay between log "received input from elastic-agent" and "loaded input config" in the order of days. It implies 2 servers are actually running in the apm-server process during this long period when the old server is stopping. The fact that 2 TBS gc goroutines may cause #14305 , actually makes #14305 also an evidence that 2 servers are running at the same time. Additionally, as only 1 reload can happen at a time, a long reload will stall other input updates.

Edit: as explained below, there are a few parts to this problem:

Make the server overlapping time bounded in a hot reload situation
Check and fix processor code that are not concurrent-safe (e.g. monitoring metrics reporting in go-docappender, tbs processor)

Provide logs (if relevant):

"Oct 7, 2024 @ 19:46:30.592"    "loaded input config"
"Oct 7, 2024 @ 19:46:30.577"    "tail sampler aborted"
"Oct 7, 2024 @ 19:46:30.576"    "LSM aggregator stopped"
"Oct 7, 2024 @ 19:46:30.576"    "stopping aggregator"
"Oct 7, 2024 @ 19:46:30.576"    "Server stopped"
"Sep 30, 2024 @ 23:41:27.228"   "Path /intake/v3/rum/events added to request handler"
"Sep 30, 2024 @ 23:41:27.228"   "Path /config/v1/rum/agents added to request handler"
"Sep 30, 2024 @ 23:41:27.228"   "Path /intake/v2/rum/events added to request handler"
"Sep 30, 2024 @ 23:41:27.228"   "RUM endpoints enabled!"
"Sep 30, 2024 @ 23:41:27.228"   "Starting apm-server [ceaf482859c86da0ba6de99005ab2fffae7551c6 built 2024-07-08 17:22:03 +0000 UTC]. Hit CTRL-C to stop it."
"Sep 30, 2024 @ 23:41:27.224"   "stopping apm-server... waiting maximum of 30s for queues to drain"
"Sep 30, 2024 @ 23:41:27.224"   "Listening on: [::]:8200"
"Sep 30, 2024 @ 23:41:27.224"   "Stop listening on: 0.0.0.0:8200"
"Sep 30, 2024 @ 23:41:27.222"   "received input from elastic-agent"

carsonip commented 1 week ago

I'm fairly convinced that it is due to slow httpServer and grpcServer shutdown, as I managed to reproduce something similar with an artificial delay:

diff --git a/internal/beatcmd/beat.go b/internal/beatcmd/beat.go
index 8a62e1c89..6d19f67fb 100644
--- a/internal/beatcmd/beat.go
+++ b/internal/beatcmd/beat.go
@@ -365,7 +365,7 @@ func (b *Beat) Run(ctx context.Context) error {
        return err
    }

-   if b.Manager.Enabled() {
+   if b.Manager.Enabled() || true {
        reloader, err := NewReloader(b.Info, b.newRunner)
        if err != nil {
            return err
@@ -377,6 +377,37 @@ func (b *Beat) Run(ctx context.Context) error {
            return fmt.Errorf("failed to start manager: %w", err)
        }
        defer b.Manager.Stop()
+
+       g.Go(func() error {
+           for {
+               in := config.MustNewConfigFrom(map[string]interface{}{
+                   "apm-server": map[string]interface{}{
+                       "rum.enabled": true,
+                       "host":        "0.0.0.0:8200",
+                       "sampling.tail": map[string]interface{}{
+                           "enabled": true,
+                           "policies": []map[string]interface{}{
+                               {"sampling_rate": 0.1},
+                           },
+                           "storage_gc_interval": "2s",
+                       },
+                   },
+               })
+               out := config.MustNewConfigFrom(map[string]interface{}{
+                   "elasticsearch": map[string]interface{}{
+                       "host":     []string{"localhost:9200"},
+                       "username": "admin",
+                       "password": "changeme",
+                   },
+               })
+               if err := reloader.reload(in, out, nil); err != nil {
+                   logp.Err("reload error")
+               }
+               time.Sleep(time.Minute)
+           }
+
+           return nil
+       })
    } else {
        if !b.Config.Output.IsSet() {
            return errors.New("no output defined, please define one under the output section")
diff --git a/internal/beater/server.go b/internal/beater/server.go
index 397e21e50..8bfec6c05 100644
--- a/internal/beater/server.go
+++ b/internal/beater/server.go
@@ -21,6 +21,7 @@ import (
    "context"
    "net"
    "net/http"
+   "time"

    "go.elastic.co/apm/module/apmgorilla/v2"
    "go.elastic.co/apm/v2"
@@ -227,6 +228,7 @@ func (s server) run(ctx context.Context) error {
        // See https://github.com/elastic/gmux/issues/13
        s.httpServer.stop()
        s.grpcServer.GracefulStop()
+       time.Sleep(5 * time.Minute)
        return nil
    })
    if err := g.Wait(); err != http.ErrServerClosed {
diff --git a/x-pack/apm-server/sampling/processor.go b/x-pack/apm-server/sampling/processor.go
index 82dc2df59..48546a542 100644
--- a/x-pack/apm-server/sampling/processor.go
+++ b/x-pack/apm-server/sampling/processor.go
@@ -394,8 +394,10 @@ func (p *Processor) Run() error {
        for {
            select {
            case <-p.stopping:
+               p.logger.Error("gc stopping")
                return nil
            case <-ticker.C:
+               p.logger.Error("gc tick")
                const discardRatio = 0.5
                var err error
                for err == nil {

With this change, initially "gc tick" will be logged every 2 seconds, but after the first reload, "gc tick" frequency doubles, indicate that 2 gc routines are running. This will cause #14305 when gc is called when another gc is running.

The reason why this delay causes 2 apm-server processors (e.g. TBS) to run concurrently is that, when reload is triggered from EA "received input", the server's context is canceled. The server is actually a wrapper over the actual gmux server and processors, let's call it wrappedServer. The shutdown sequence will be that gmux server first shuts down, then only after gmux server shuts down, the processors .Stop() will be called. But in the reloader, a new wrappedServer is already running while the old wrappedServer is stopping (yes, it stops listening, but processors are still running). This is a plausible explanation of the observed logs.

TLDR: in a hot reload, there is a period of time where an old server and new server run concurrently. We need to limit this reload time, as well as to ensure the processors are fine with running concurrently (e.g. to have 2 TBS processor running at the same time during hot reload).

carsonip commented 1 week ago

To fix this bug, it requires a few changes

httpServer and grpcServer need to respect shutdown timeout. It cannot just wait indefinitely for connections to terminate.
But even with that and e.g. a shutdown timeout of 10s, in the current design, it is still possible for 2 wrapped servers to run in parallel during this 10s. This is fine for the most part, e.g. aggregation processor (since badger run in memory), bulk indexer, but will be fatal for TBS processor, and possibly some libbeat metric registry code (e.g. in newFinalBatchProcessor which may or may not be broken now, as well as TBS metrics). TBS indexing might actually be ok, but GC definitely cannot run concurrently.
Alternatively, we can make the new wrappedServer wait for the old wrappedServer to shutdown completely, but that may imply some downtime, especially when there's a shutdown timeout > 0. But this is the safest option.

carsonip commented 1 week ago

https://github.com/elastic/apm-server/pull/14339 is merged, but keeping this issue open, as we want to double check if all the processors are fine with concurrent runs.

elastic / apm-server

Long delay between EA received input and loaded #14337