elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
108 stars 4.93k forks source link

Unnecessary extra reload on integration policy change under EA managed mode #41751

Open carsonip opened 2 weeks ago

carsonip commented 2 weeks ago

APM Server version (apm-server version): 8.15.0+

Description of the problem including expected versus actual behavior:

APM server reloader reloads runner (internal/beater/beater.go) 2 times instead of 1 when integration policy is changed.

This is likely due to nil check being earlier than the unchanged check in EA apm tracing PR: https://github.com/elastic/beats/pull/40030/files#diff-dd024b66f2ed585cbbfce1d8a0aeba55183ab0d04bed2a67f7738f62b5658e63R896-R908 . This causes apm tracing config nil -> nil to trigger a reload.

When this is fixed, we should also remove the workaround in https://github.com/elastic/apm-server/pull/14574/files#diff-e91b3485079439409a6365ec6aa74c2c009623463528bc9d541ca6494df8cf35R350

Steps to reproduce:

  1. Start EA managed apm-server e.g. on ESS
  2. Update apm integration policy
  3. Observe that every loaded input config log line is accompanied by a loaded apm tracing config even though apm tracing config has not changed. This implies 2 reloads are done instead of 1.

Provide logs (if relevant):

apm-server log lines from new to old. Everything is triggered by BeatV2Manager.unitListen, followed by loaded apm tracing config, then loaded input config.

"@timestamp",message
"Nov 8, 2024 @ 09:55:56.291","no longer blocking ingestion as all precondition checks are now satisfied"
"Nov 8, 2024 @ 09:55:56.289","Starting apm-server [f4b8e24ec4828569d47ef590282506d43367885c built 2024-10-08 15:25:30 +0000 UTC]. Hit CTRL-C to stop it."
"Nov 8, 2024 @ 09:55:56.289","update routine done"
"Nov 8, 2024 @ 09:55:56.289","CORS related setting `apm-server.rum.allow_origins` allows all origins. Consider more restrictive setting for production use."
"Nov 8, 2024 @ 09:55:56.289","Path /intake/v2/events added to request handler"
"Nov 8, 2024 @ 09:55:56.289","Path /intake/v3/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.289","RUM endpoints enabled!"
"Nov 8, 2024 @ 09:55:56.289","SSL enabled."
"Nov 8, 2024 @ 09:55:56.289","Path /v1/logs added to request handler"
"Nov 8, 2024 @ 09:55:56.289","Path /v1/metrics added to request handler"
"Nov 8, 2024 @ 09:55:56.289","loaded input config"
"Nov 8, 2024 @ 09:55:56.289","Path /v1/traces added to request handler"
"Nov 8, 2024 @ 09:55:56.288","Path / added to request handler"
"Nov 8, 2024 @ 09:55:56.288","running final aggregation"
"Nov 8, 2024 @ 09:55:56.288","Path /config/v1/rum/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.288","Path /intake/v2/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.288","LSM aggregator stopped"
"Nov 8, 2024 @ 09:55:56.288","Path /config/v1/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.287","Stop listening on: 0.0.0.0:8200"
"Nov 8, 2024 @ 09:55:56.287","SSL enabled."
"Nov 8, 2024 @ 09:55:56.287","RUM endpoints enabled!"
"Nov 8, 2024 @ 09:55:56.287","Server stopped"
"Nov 8, 2024 @ 09:55:56.287","stopping aggregator"
"Nov 8, 2024 @ 09:55:56.287","CORS related setting `apm-server.rum.allow_origins` allows all origins. Consider more restrictive setting for production use."
"Nov 8, 2024 @ 09:55:56.287","Path /v1/metrics added to request handler"
"Nov 8, 2024 @ 09:55:56.287","Path /v1/logs added to request handler"
"Nov 8, 2024 @ 09:55:56.287","Starting apm-server [f4b8e24ec4828569d47ef590282506d43367885c built 2024-10-08 15:25:30 +0000 UTC]. Hit CTRL-C to stop it."
"Nov 8, 2024 @ 09:55:56.286","Path /config/v1/rum/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.286","Path /intake/v2/events added to request handler"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.Transactions.MaxGroups set to 5000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","blocking ingestion until all preconditions are satisfied"
"Nov 8, 2024 @ 09:55:56.286","Path /v1/traces added to request handler"
"Nov 8, 2024 @ 09:55:56.286","docappender.MaxRequests set to 11 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Path /intake/v3/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.286","Path /intake/v2/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.286","docappender.DocumentBufferSize set to 819 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.ServiceTransactions.MaxGroups for service aggregation set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Path / added to request handler"
"Nov 8, 2024 @ 09:55:56.286","MaxConcurrentDecoders set to 102 based on 80 percent of 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.ServiceDestinations.MaxGroups set to 10000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.MaxServices set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Path /config/v1/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.284","failed to populate sourcemap metadata: failed to run initial search query: fetcher unavailable: server shutting down"
"Nov 8, 2024 @ 09:55:56.284","docappender.MaxRequests set to 11 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","docappender.DocumentBufferSize set to 819 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","MaxConcurrentDecoders set to 102 based on 80 percent of 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.ServiceTransactions.MaxGroups for service aggregation set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","blocking ingestion until all preconditions are satisfied"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.MaxServices set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.Transactions.MaxGroups set to 5000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.ServiceDestinations.MaxGroups set to 10000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","precondition failed: context canceled"
"Nov 8, 2024 @ 09:55:56.282","Listening on: [::]:8200"
"Nov 8, 2024 @ 09:55:56.282","using output.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:56.282","Falling back to elasticsearch output for API Key usage"
"Nov 8, 2024 @ 09:55:56.282","using apm-server.agent.config.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:56.282","stopping apm-server... waiting maximum of 30s for queues to drain"
"Nov 8, 2024 @ 09:55:56.281","update routine done"
"Nov 8, 2024 @ 09:55:56.280","LSM aggregator stopped"
"Nov 8, 2024 @ 09:55:56.280","running final aggregation"
"Nov 8, 2024 @ 09:55:56.280","Falling back to elasticsearch output for API Key usage"
"Nov 8, 2024 @ 09:55:56.280","Stop listening on: 0.0.0.0:8200"
"Nov 8, 2024 @ 09:55:56.280","stopping apm-server... waiting maximum of 30s for queues to drain"
"Nov 8, 2024 @ 09:55:56.280","received input from elastic-agent"
"Nov 8, 2024 @ 09:55:56.280","Listening on: [::]:8200"
"Nov 8, 2024 @ 09:55:56.280","Server stopped"
"Nov 8, 2024 @ 09:55:56.280","stopping aggregator"
"Nov 8, 2024 @ 09:55:56.280","loaded apm tracing config"
"Nov 8, 2024 @ 09:55:56.279","using output.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:56.279","using apm-server.agent.config.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:55.278","BeatV2Manager.unitListen UnitChanged.ID(apm-es-containerhost-elastic-cloud-apm), UnitChanged.Type(modified), UnitChanged.Trigger(2): modified/config_change_triggered"
simitt commented 2 weeks ago

@carsonip can you please assess the impact of this in your description and then add an according impact label for this. While I agree the reload is unnecessary, how much of an optimization is it vs. what's the negative consequence.

carsonip commented 2 weeks ago

IIUC the bug sits in beats. It is better seen as a regression from supporting apm tracing config reload in https://github.com/elastic/beats/pull/40030 . As for the impact, I don't think it will create any actual problems, other than some extra API calls e.g. for agentcfg and sourcemap fetcher as part of the initialization process. If any new connection is unlucky enough to use the new runner that gets stopped immediately, these connections will need to be drained. Also, although I haven't studied how likely it is relevant, but it is possible that a reload that quick will trigger some race conditions (a bug in itself), e.g. if runners from 2 consecutive quick restarts race to modify a global monitoring registry.

TLDR: not high impact, but as this is a recent regression, it would be good to optimize this.

botelastic[bot] commented 17 hours ago

This issue doesn't have a Team:<team> label.