Open carsonip opened 1 week ago
I'm fairly convinced that it is due to slow httpServer and grpcServer shutdown, as I managed to reproduce something similar with an artificial delay:
diff --git a/internal/beatcmd/beat.go b/internal/beatcmd/beat.go
index 8a62e1c89..6d19f67fb 100644
--- a/internal/beatcmd/beat.go
+++ b/internal/beatcmd/beat.go
@@ -365,7 +365,7 @@ func (b *Beat) Run(ctx context.Context) error {
return err
}
- if b.Manager.Enabled() {
+ if b.Manager.Enabled() || true {
reloader, err := NewReloader(b.Info, b.newRunner)
if err != nil {
return err
@@ -377,6 +377,37 @@ func (b *Beat) Run(ctx context.Context) error {
return fmt.Errorf("failed to start manager: %w", err)
}
defer b.Manager.Stop()
+
+ g.Go(func() error {
+ for {
+ in := config.MustNewConfigFrom(map[string]interface{}{
+ "apm-server": map[string]interface{}{
+ "rum.enabled": true,
+ "host": "0.0.0.0:8200",
+ "sampling.tail": map[string]interface{}{
+ "enabled": true,
+ "policies": []map[string]interface{}{
+ {"sampling_rate": 0.1},
+ },
+ "storage_gc_interval": "2s",
+ },
+ },
+ })
+ out := config.MustNewConfigFrom(map[string]interface{}{
+ "elasticsearch": map[string]interface{}{
+ "host": []string{"localhost:9200"},
+ "username": "admin",
+ "password": "changeme",
+ },
+ })
+ if err := reloader.reload(in, out, nil); err != nil {
+ logp.Err("reload error")
+ }
+ time.Sleep(time.Minute)
+ }
+
+ return nil
+ })
} else {
if !b.Config.Output.IsSet() {
return errors.New("no output defined, please define one under the output section")
diff --git a/internal/beater/server.go b/internal/beater/server.go
index 397e21e50..8bfec6c05 100644
--- a/internal/beater/server.go
+++ b/internal/beater/server.go
@@ -21,6 +21,7 @@ import (
"context"
"net"
"net/http"
+ "time"
"go.elastic.co/apm/module/apmgorilla/v2"
"go.elastic.co/apm/v2"
@@ -227,6 +228,7 @@ func (s server) run(ctx context.Context) error {
// See https://github.com/elastic/gmux/issues/13
s.httpServer.stop()
s.grpcServer.GracefulStop()
+ time.Sleep(5 * time.Minute)
return nil
})
if err := g.Wait(); err != http.ErrServerClosed {
diff --git a/x-pack/apm-server/sampling/processor.go b/x-pack/apm-server/sampling/processor.go
index 82dc2df59..48546a542 100644
--- a/x-pack/apm-server/sampling/processor.go
+++ b/x-pack/apm-server/sampling/processor.go
@@ -394,8 +394,10 @@ func (p *Processor) Run() error {
for {
select {
case <-p.stopping:
+ p.logger.Error("gc stopping")
return nil
case <-ticker.C:
+ p.logger.Error("gc tick")
const discardRatio = 0.5
var err error
for err == nil {
With this change, initially "gc tick" will be logged every 2 seconds, but after the first reload, "gc tick" frequency doubles, indicate that 2 gc routines are running. This will cause #14305 when gc is called when another gc is running.
The reason why this delay causes 2 apm-server processors (e.g. TBS) to run concurrently is that, when reload is triggered from EA "received input", the server's context is canceled. The server is actually a wrapper over the actual gmux server and processors, let's call it wrappedServer. The shutdown sequence will be that gmux server first shuts down, then only after gmux server shuts down, the processors .Stop()
will be called. But in the reloader, a new wrappedServer is already running while the old wrappedServer is stopping (yes, it stops listening, but processors are still running). This is a plausible explanation of the observed logs.
TLDR: in a hot reload, there is a period of time where an old server and new server run concurrently. We need to limit this reload time, as well as to ensure the processors are fine with running concurrently (e.g. to have 2 TBS processor running at the same time during hot reload).
To fix this bug, it requires a few changes
https://github.com/elastic/apm-server/pull/14339 is merged, but keeping this issue open, as we want to double check if all the processors are fine with concurrent runs.
APM Server version (
apm-server version
): 8.14.3Description of the problem including expected versus actual behavior:
In EA managed apm-server, there are observations that there can be delay between log "received input from elastic-agent" and "loaded input config" in the order of days. It implies 2 servers are actually running in the apm-server process during this long period when the old server is stopping. The fact that 2 TBS gc goroutines may cause #14305 , actually makes #14305 also an evidence that 2 servers are running at the same time. Additionally, as only 1 reload can happen at a time, a long reload will stall other input updates.
Edit: as explained below, there are a few parts to this problem:
Provide logs (if relevant):