Open juliaElastic opened 4 months ago
I believe that I have identified the issue. Below shows the flow and where the issue exists:
elastic-agent.yml
file that has all the cloud settings present.elastic-agent.yml
with a file that only has fleet.enabled: true
and any of these settings from the original elastic-agent.yml
- https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/cmd/enroll_cmd.go#L1056run
reading the contents of elastic-agent.yml
elastic-agent.yml
hiding the fact that the container is actually running from a different configurationIt is possible for 5 and 6 to get crossed based on timing which can very rarely cause APM tracing to work as @simitt noted once.
There are a few possible solutions:
getPersistentConfig
to handle the tracing options.
ELASTIC_AGENT_CLOUD
environment variable to not overwrite the elastic-agent.yml
.
elastic-agent.yml
if it already contains the fleet.enabled: true
Looking on input for proper solution from @elastic/elastic-agent-control-plane
Moved this to the Elastic Agent repository because it is an Elastic Agent issue, not a Fleet Server or APM issue.
Additional option:
fleet-setup.yml
file that containers use
https://github.com/elastic/elastic-agent/blob/66c24833957bd67aedbc45b3c39d09414c81de47/internal/pkg/agent/cmd/container.go#L201-L214Which uses the setupConfig
struct, that would require https://github.com/elastic/elastic-agent/issues/5199 to be implemented, and change the allocator on cloud to write fleet-setup.yml
instead of elastic-agent.yml
.
Thanks for the investigation, @blakerouse, and thanks for the additional option, @michel-laterman. I'm adding this issue to our next sprint, which starts Monday, and assigning it to @michel-laterman, looking at other priorities and capacity. So let's pick it up then.
Additional option: 4. If I'm understanding the cloud startup process correctly; we can use the existing
fleet-setup.yml
file that containers useWhich uses the
setupConfig
struct, that would require #5199 to be implemented, and change the allocator on cloud to writefleet-setup.yml
instead ofelastic-agent.yml
.
That could work if fleet-setup.yml
options are persistent after enrollment. Looking at the code its not clear if that is the case, I don't believe it is.
I lean towards option 3, but I don't know if we want that behavior. Seems okay to me, but might not be what we want for the product (aka. allowing all options to be used locally before the policy is applied on top).
Don't overwrite the elastic-agent.yml if it already contains the fleet.enabled: true
Interestingly, we already attempted to do this https://github.com/elastic/elastic-agent/pull/4166 see the shouldSkipReplace
calls.
This feels like along the lines of what users want to happen for Fleet managed agents, which is that their initial configuration turns something on, and then if Fleet doesn't set it or also turns it on it stays enabled. There is no window where a feature is briefly disabled during the transition from standalone to Fleet during enrollment.
Discussed this in today's team meeting and @cmacknz suggested an alternative approach that might solve the problem for APM Server in ESS. To be clear, this issue here is still pointing to a larger issue we have with config replacement so it needs to be tackled per the options being discussed in the last few comments but it might not be as high priority if we can implement @cmacknz's proposed approach for APM Server in ESS.
That approach is basically to set the APM tracing configuration as part of the APM Server policy in ESS. I'm going to test it using the overrides API and verify that the config is passed all the way down to the APM Server component by looking at the diagnostic. cc: @simitt
That approach is basically to set the APM tracing configuration as part of the APM Server policy in ESS. I'm going to test it using the overrides API and verify that the config is passed all the way down to the APM Server component by looking at the diagnostic. cc: @simitt
I tested this today and it should work.
I created a new policy (because I couldn't override the preset Elastic Cloud agent policy
in ESS even with the policy override API), enrolled an Agent with it, took a diagnostic and observed that agent.monitoring.traces
was not set:
$ cat diag-before/pre-config.yaml | yq '.agent.monitoring'
{
"enabled": true,
"logs": true,
"metrics": true,
"namespace": "default",
"use_output": "default"
}
Then, using the policy override API, I set agent.monitoring.traces: true
in this policy, took another diagnostic, and observed that agent.monitoring.traces
was indeed set:
cat diag-after/pre-config.yaml | yq '.agent.monitoring'
{
"enabled": true,
"logs": true,
"metrics": true,
"namespace": "default",
"traces": true,
"use_output": "default"
}
So I believe the solution would be to update the policy-elastic-agent-on-cloud
policy in ESS to enable traces as part of monitoring. I'm not sure if the right way to do that is to add tracing
to this array — @kpollich could you advise?
cc: @simitt
The agent policy config schema would also need to be update to allow traces
to be set in that array here:
Without this change, the config including monitoring_enabled: ['traces']
would be considered invalid and Kibana would throw an error on startup.
Thanks @kpollich. Here is the PR to allow traces
to be added to the monitoring_enabled
array: https://github.com/elastic/kibana/pull/189908
@ycombinator thanks for looking into alternatives here. We are very eager to finally get apm tracing enabled for apm-server, however, I am a bit worried about building a snowflake solution here, and would prefer a general fix, to avoid any conflicts with future changes that might not consider this specific solution.
We also won't be able to turn it on by default unless EA supports sampling https://github.com/elastic/elastic-agent/issues/5211. Will this be supported in 8.16.0
?
I am a bit worried about building a snowflake solution here, and would prefer a general fix, to avoid any conflicts with future changes that might not consider this specific solution.
Hi @simitt, sorry I wasn't clearer in https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2263395538, but the proposed approach is not a snowflake or workaround solution. @cmacknz can keep me honest but the thinking here is that since the APM Server in ESS is part of a Fleet-managed EA, it makes sense for any configuration for that APM Server to come from the Fleet-managed policy (as opposed to from the EA policy that's locally on disk).
We also won't be able to turn it on by default unless EA supports sampling https://github.com/elastic/elastic-agent/issues/5211. Will this be supported in 8.16.0?
Yes, as of now 8.16.0
is the target for this issue. We have allocated it to a sprint that will finish before 8.16.0
feature freeze.
Thanks for the clarifications and timelines!
I tested on ESS QA with a 8.16.0-SNAPSHOT
deployment and I see that the "Elastic Cloud Agent policy" now has agent.monitoring.enabled: true
and agent.monitoring.traces: true
.
However, I'm not 100% sure if this is sufficient. @juliaElastic do we also need to inject the following section under agent.monitoring
?
apm:
hosts:
- <apm host url>
environment: <apm environment>
secret_token: <secret token>
If so, where would one get the <apm host url>
, <apm environment>
and <secret token>
from? And where/how would the above section be injected under https://github.com/elastic/cloud-assets/blob/4e9cf8979f57fd08db8a7ebb2b476b852fbd72bf/stackpack/kibana/config/kibana.yml#L250-L252?
@ycombinator Yeah, it's needed I think. There is a logic in the cloud repo that sets these values, the issue is they are not applied in agent. https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/ApmDockerContainer.scala#L434
Thanks @juliaElastic.
There is a logic in the cloud repo that sets these values...
In that case, why am I not seeing them in the "Elastic Cloud Agent policy" (see the screenshot in https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2319333019)? Do I need to do some extra configuration elsewhere to have these values show up in the policy?
... the issue is they are not applied in agent.
Yes, I see the temporary workaround in Agent code in: https://github.com/elastic/elastic-agent/blob/fd477ec3aa00afc77ae86e7fa9a4ce978a6958b6/internal/pkg/agent/application/apm_config_modifier.go#L111-L114
Once we can confirm that Agent is able to receive the values from Fleet (for that I need to know the answer to the questions above), we can work on making the necessary changes in Agent to remove the temporary workaround.
@ycombinator AFAIK these APM configs are not added to the agent policy, but directly to elastic-agent.yml
in cloud, see this issue. @AlexP-Elastic should know the details of how this works in cloud.
@juliaElastic Right, that's what we are trying to move away from and have the APM configuration be part of the Fleet-managed Agent policy instead 🙂. See https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2263395538.
So is there some way we can make that happen? I was able to get agent.monitoring.traces: true
to show up in the Fleet-managed policy (see the screenshot in https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2319333019) but now I'm wondering how I can get the following section also show up under agent.monitoring
:
apm:
hosts:
- <apm host url>
environment: <apm environment>
secret_token: <secret token>
I see, it seems an APM config is already there in the cloud policy based on a conditional, so maybe we just need to tweak the condition to enable it on all internal ESS clusters: https://github.com/elastic/cloud-assets/blob/4e9cf8979f57fd08db8a7ebb2b476b852fbd72bf/stackpack/kibana/config/kibana.yml#L315-L343
To be able to send the APM data the following will be add
xpack.fleet.agentPolicies:
# Test policy
- name: Elastic Cloud agent policy
description: Default agent policy for agents hosted on Elastic Cloud
id: test-policy
is_default: false
is_managed: true
overrides:
agent.monitoring:
apm:
hosts:
- <apm host url>
environment: <apm environment>
secret_token: <secret token>
is_default_fleet_server: false
namespace: default
monitoring_enabled: ['traces']
@juliaElastic @ycombinator We need to figure where we went those trace to be sent it seems it's not clear, do we want them sent to the APM server running on the deployment (so accessible for the user) or it for our usage and we went to send that to a monitoring cluster?
If it's to a monitoring cluster it should probably be set with something like this https://github.com/elastic/cloud/blob/a8739e38c6184e57fc5a5477f3b795f8fa71b79a/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/KibanaDockerContainer.scala#L569
We need to figure where we went those trace to be sent it seems it's not clear, do we want them sent to the APM server running on the deployment (so accessible for the user) or it for our usage and we went to send that to a monitoring cluster?
@simitt is probably the best person to answer this question.
The apm data should be sent to the internal Elastic cluster, for support engineers and developers to leverage the data for troubleshooting.
Thanks @simitt.
@juliaElastic @nchaulet Given this, do you know what the policy should use for the values of <apm host url>
, <apm environment>
, and <secret token>
? And, more importantly, from where/how it should get those values to inject them into the policy?
BTW, since I've been testing on ESS QA and feature freeze for ms-113.0
is coming up, I'm going to revert my Cloud PRs (specifically https://github.com/elastic/cloud/pull/130605, https://github.com/elastic/cloud/pull/130769, and https://github.com/elastic/cloud-assets/pull/1569) for now, so these changes don't accidentally get released to production. Once ms-113.0
has been released, I will reintroduce these (and any other changes based on the answer to the previous comment) and resume testing in ESS QA.
I'm not sure how to reference the internal Elastic cluster in kibana config. @simitt Could you help with that?
I think @AlexP-Elastic might be able to provide the details here, as ES and Kibana are already shipping tracing data to cloud regional clusters.
I'm not sure how to reference the internal Elastic cluster in kibana config
The Elastic cluster is injected by control plane into the templated config files in the stack pack: https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/KibanaDockerContainer.scala#L561-L589
for regional values, and
for global values
so assuming I'm understanding correctly and you want some Kibana code to inject them into the APM policy, easiest would be if you could reference elastic.apm.serverUrl
/ elastic.apm.secretToken
/ elastic.apm.environment
from the YAML
(there's an additional complication in getting the fields that are needed to bypass IP filtering)
so assuming I'm understanding correctly and you want some Kibana code to inject them into the APM policy, easiest would be if you could reference elastic.apm.serverUrl / elastic.apm.secretToken / elastic.apm.environment from the YAML
Do these values already available to reference in kibana.yml, or is there a code change needed to make them work?
These values are already in Kibana YAML (for versions of Kibana that support them)
You can't see them in the stackpack you are linked because they are injected by the control plane infrastructure explicitly (not via the templating we use for defaults)
I tested locally to set monitoring in a preconfigured policy with overrides as Nicolas suggested here: https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2321621585 It seems to work, so I'm going to add this to the cloud config.
@AlexP-Elastic I tested the change in the latest 8.16-SNAPSHOT and it seems the substitutions of elastic.apm.serverUrl / elastic.apm.secretToken / elastic.apm.environment
didn't work. How can we fix this?
Instance: https://staging.found.no/deployments/edc7201fc9a5448982955f7017e02106
agent:
monitoring:
enabled: true
traces: true
apm:
hosts:
- ''
environment: null
secret_token: null
Here are extracts from the Kibana YAML file for that cluster (assuming I'm correctly understanding what you were doing - reading entries from the Kibana YAML and using that to fill in fields in a policy dynamically?)
elastic.apm.serverUrl: "https://27111427665b4b01a36414f98516aa81.apm.eastus2.staging.azure.foundit.no:443"
elastic.apm.secretToken: "REDACTED"
elastic.apm.serviceNodeName: "edc7201fc9a5448982955f7017e02106-kibana-instance-0000000000"
elastic.apm.globalLabels.deploymentId: "edc7201fc9a5448982955f7017e02106"
elastic.apm.globalLabels.deploymentName: "jb test 8.16 sept 10"
elastic.apm.globalLabels.organizationId: "922053753"
It seems like if this (this=reading fields from YAML) works locally on a laptop, it should work everywhere?
One thing I don't fully understand is what the goal is here?
The elastic-agent YAML local to the APM container has the right config also:
agent.monitoring.apm.hosts:
- "https://27111427665b4b01a36414f98516aa81.apm.eastus2.staging.azure.foundit.no:443"
agent.monitoring.apm.secret_token: "REDACTED"
agent.monitoring.apm.global_labels.serviceNodeName: "edc7201fc9a5448982955f7017e02106-apm-instance-0000000000"
agent.monitoring.apm.global_labels.deploymentId: "edc7201fc9a5448982955f7017e02106"
agent.monitoring.apm.global_labels.deploymentName: "jb test 8.16 sept 10"
agent.monitoring.apm.global_labels.organizationId: "922053753"
Is it not preferable for the agent to inject this config to APM?
Oh no, I just realized I was guilty of totally not understanding what you were doing and giving your last PR a distracted LGTM instead of reading it :(
Sorry about that
what I thought you were proposing to do was have Kibana inject the fields into the policy (and the PR https://github.com/elastic/cloud-assets/pull/1573/files was just adding some placeholder fields)
What it actually does
agent.monitoring:
apm:
hosts:
- "{{ elastic.apm.serverUrl }}"
environment: {{ elastic.apm.environment }}
secret_token: {{ elastic.apm.secretToken }}
doesn't work at all because elastic.apm.serverUrl
isn't an actual variable in the templating (it's part of the rendered YAML, which cannot be used to render the same YAML :) )
Hang on I need to think about this a moment, now I know you are trying to do it all via templates and not via code in Kibana
@AlexP-Elastic No worries, we could inject the fields from kibana too, I'm just not sure where to take these values from in kibana.
This is one option: https://github.com/elastic/cloud/pull/131470/files .. I think this is preferable to writing code inside Kibana, if we have to do it using the existing settings
I think my preferred architectural solution would be for the APM container to take the values injected into elastic-agent.yaml
and merge them with the policy ... but presumably that is not easy to do?
I think my preferred architectural solution would be for the APM container to take the values injected into elastic-agent.yaml and merge them with the policy ... but presumably that is not easy to do?
Yeah I'm not sure if we could change the preconfiguration or call the kibana Fleet API from the APM container to modify the cloud agent policy.
@juliaElastic - Is this blocked because the root cause fix here would involve makes changes to APM itself, or because we are waiting on https://github.com/elastic/cloud/pull/131470?
I'm waiting to see if https://github.com/elastic/cloud/pull/131470 works, otherwise we will need to get some help from the APM team to see if we can add the config from the APM container.
I'm just testing https://github.com/elastic/cloud/pull/131470 now, I think we (control plane) are happy to go forward with this as the plan, so once it's working we'll get it merged (hopefully by the end of the week) and you can follow the kibana.yml
in that PR to make the changes to cloud-assets (which is what ESS runs)
We discovered an issue with the default APM config injected by cloud to internal ESS clusters: https://elasticco.atlassian.net/browse/CP-3464
It seems this APM config is not applied on a newly created clusters, and fleet-server traces are not being sent to https://overview.elastic-cloud.com/app/r/s/JIEzg
It is not clear if the issue is on elastic-agent or fleet-server side.
Related doc: https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/dev_docs/apm_tracing.md
Originally posted by @juliaElastic in https://github.com/elastic/fleet-server/issues/3328#issuecomment-2097894094