APM config from local configuration before enrollment is lost

juliaElastic commented 4 months ago

We discovered an issue with the default APM config injected by cloud to internal ESS clusters: https://elasticco.atlassian.net/browse/CP-3464

It seems this APM config is not applied on a newly created clusters, and fleet-server traces are not being sent to https://overview.elastic-cloud.com/app/r/s/JIEzg

It is not clear if the issue is on elastic-agent or fleet-server side.

Originally posted by @juliaElastic in https://github.com/elastic/fleet-server/issues/3328#issuecomment-2097894094

blakerouse commented 1 month ago

I believe that I have identified the issue. Below shows the flow and where the issue exists:

container is started with a custom elastic-agent.yml file that has all the cloud settings present.
container performs enrollment
container writes fleet.yml
container moves/replaces the elastic-agent.yml with a file that only has fleet.enabled: true and any of these settings from the original elastic-agent.yml - https://github.com/elastic/elastic-agent/blob/main/internal/pkg/agent/cmd/enroll_cmd.go#L1056
container performs run reading the contents of elastic-agent.yml
ESS comes along and overwrites the elastic-agent.yml hiding the fact that the container is actually running from a different configuration

It is possible for 5 and 6 to get crossed based on timing which can very rarely cause APM tracing to work as @simitt noted once.

There are a few possible solutions:

Update the getPersistentConfig to handle the tracing options.
- Pro: same path we take today
- Con: always have to keep updating this code for new options (seems some options are already in the file from cloud that are being ignored)
Use the ELASTIC_AGENT_CLOUD environment variable to not overwrite the elastic-agent.yml.
- Pro: cloud specific and can add any setting they want
- Con: cloud specific (haha)
Don't overwrite the elastic-agent.yml if it already contains the fleet.enabled: true
- Pro: works for all users that want to provide a custom configuration on Fleet
- Con: we might not want users to do that

Looking on input for proper solution from @elastic/elastic-agent-control-plane

blakerouse commented 1 month ago

Moved this to the Elastic Agent repository because it is an Elastic Agent issue, not a Fleet Server or APM issue.

michel-laterman commented 1 month ago

Additional option:

If I'm understanding the cloud startup process correctly; we can use the existing fleet-setup.yml file that containers use https://github.com/elastic/elastic-agent/blob/66c24833957bd67aedbc45b3c39d09414c81de47/internal/pkg/agent/cmd/container.go#L201-L214

Which uses the setupConfig struct, that would require https://github.com/elastic/elastic-agent/issues/5199 to be implemented, and change the allocator on cloud to write fleet-setup.yml instead of elastic-agent.yml.

ycombinator commented 1 month ago

Thanks for the investigation, @blakerouse, and thanks for the additional option, @michel-laterman. I'm adding this issue to our next sprint, which starts Monday, and assigning it to @michel-laterman, looking at other priorities and capacity. So let's pick it up then.

blakerouse commented 1 month ago

Additional option: 4. If I'm understanding the cloud startup process correctly; we can use the existing fleet-setup.yml file that containers use

https://github.com/elastic/elastic-agent/blob/66c24833957bd67aedbc45b3c39d09414c81de47/internal/pkg/agent/cmd/container.go#L201-L214

Which uses the setupConfig struct, that would require #5199 to be implemented, and change the allocator on cloud to write fleet-setup.yml instead of elastic-agent.yml.

That could work if fleet-setup.yml options are persistent after enrollment. Looking at the code its not clear if that is the case, I don't believe it is.

I lean towards option 3, but I don't know if we want that behavior. Seems okay to me, but might not be what we want for the product (aka. allowing all options to be used locally before the policy is applied on top).

cmacknz commented 1 month ago

Don't overwrite the elastic-agent.yml if it already contains the fleet.enabled: true

Interestingly, we already attempted to do this https://github.com/elastic/elastic-agent/pull/4166 see the shouldSkipReplace calls.

This feels like along the lines of what users want to happen for Fleet managed agents, which is that their initial configuration turns something on, and then if Fleet doesn't set it or also turns it on it stays enabled. There is no window where a feature is briefly disabled during the transition from standalone to Fleet during enrollment.

ycombinator commented 1 month ago

Discussed this in today's team meeting and @cmacknz suggested an alternative approach that might solve the problem for APM Server in ESS. To be clear, this issue here is still pointing to a larger issue we have with config replacement so it needs to be tackled per the options being discussed in the last few comments but it might not be as high priority if we can implement @cmacknz's proposed approach for APM Server in ESS.

That approach is basically to set the APM tracing configuration as part of the APM Server policy in ESS. I'm going to test it using the overrides API and verify that the config is passed all the way down to the APM Server component by looking at the diagnostic. cc: @simitt

ycombinator commented 1 month ago

That approach is basically to set the APM tracing configuration as part of the APM Server policy in ESS. I'm going to test it using the overrides API and verify that the config is passed all the way down to the APM Server component by looking at the diagnostic. cc: @simitt

I tested this today and it should work.

I created a new policy (because I couldn't override the preset Elastic Cloud agent policy in ESS even with the policy override API), enrolled an Agent with it, took a diagnostic and observed that agent.monitoring.traces was not set:

$ cat diag-before/pre-config.yaml | yq '.agent.monitoring'
{
  "enabled": true,
  "logs": true,
  "metrics": true,
  "namespace": "default",
  "use_output": "default"
}

Then, using the policy override API, I set agent.monitoring.traces: true in this policy, took another diagnostic, and observed that agent.monitoring.traces was indeed set:

cat diag-after/pre-config.yaml | yq '.agent.monitoring'
{
  "enabled": true,
  "logs": true,
  "metrics": true,
  "namespace": "default",
  "traces": true,
  "use_output": "default"
}

So I believe the solution would be to update the policy-elastic-agent-on-cloud policy in ESS to enable traces as part of monitoring. I'm not sure if the right way to do that is to add tracing to this array — @kpollich could you advise?

cc: @simitt

kpollich commented 1 month ago

The agent policy config schema would also need to be update to allow traces to be set in that array here:

https://github.com/elastic/kibana/blob/7340abb888614670502bda936bf162aad80d5fa7/x-pack/plugins/fleet/server/types/models/agent_policy.ts#L57-L61

Without this change, the config including monitoring_enabled: ['traces'] would be considered invalid and Kibana would throw an error on startup.

ycombinator commented 1 month ago

Thanks @kpollich. Here is the PR to allow traces to be added to the monitoring_enabled array: https://github.com/elastic/kibana/pull/189908

simitt commented 1 month ago

@ycombinator thanks for looking into alternatives here. We are very eager to finally get apm tracing enabled for apm-server, however, I am a bit worried about building a snowflake solution here, and would prefer a general fix, to avoid any conflicts with future changes that might not consider this specific solution. We also won't be able to turn it on by default unless EA supports sampling https://github.com/elastic/elastic-agent/issues/5211. Will this be supported in 8.16.0?

ycombinator commented 1 month ago

I am a bit worried about building a snowflake solution here, and would prefer a general fix, to avoid any conflicts with future changes that might not consider this specific solution.

Hi @simitt, sorry I wasn't clearer in https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2263395538, but the proposed approach is not a snowflake or workaround solution. @cmacknz can keep me honest but the thinking here is that since the APM Server in ESS is part of a Fleet-managed EA, it makes sense for any configuration for that APM Server to come from the Fleet-managed policy (as opposed to from the EA policy that's locally on disk).

ycombinator commented 1 month ago

We also won't be able to turn it on by default unless EA supports sampling https://github.com/elastic/elastic-agent/issues/5211. Will this be supported in 8.16.0?

Yes, as of now 8.16.0 is the target for this issue. We have allocated it to a sprint that will finish before 8.16.0 feature freeze.

simitt commented 1 month ago

Thanks for the clarifications and timelines!

ycombinator commented 2 weeks ago

I tested on ESS QA with a 8.16.0-SNAPSHOT deployment and I see that the "Elastic Cloud Agent policy" now has agent.monitoring.enabled: true and agent.monitoring.traces: true.

However, I'm not 100% sure if this is sufficient. @juliaElastic do we also need to inject the following section under agent.monitoring?

apm:
    hosts:
      - <apm host url>
    environment: <apm environment>
    secret_token: <secret token>

If so, where would one get the <apm host url>, <apm environment> and <secret token> from? And where/how would the above section be injected under https://github.com/elastic/cloud-assets/blob/4e9cf8979f57fd08db8a7ebb2b476b852fbd72bf/stackpack/kibana/config/kibana.yml#L250-L252?

juliaElastic commented 2 weeks ago

@ycombinator Yeah, it's needed I think. There is a logic in the cloud repo that sets these values, the issue is they are not applied in agent. https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/ApmDockerContainer.scala#L434

ycombinator commented 2 weeks ago

Thanks @juliaElastic.

There is a logic in the cloud repo that sets these values...

In that case, why am I not seeing them in the "Elastic Cloud Agent policy" (see the screenshot in https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2319333019)? Do I need to do some extra configuration elsewhere to have these values show up in the policy?

... the issue is they are not applied in agent.

Yes, I see the temporary workaround in Agent code in: https://github.com/elastic/elastic-agent/blob/fd477ec3aa00afc77ae86e7fa9a4ce978a6958b6/internal/pkg/agent/application/apm_config_modifier.go#L111-L114

Once we can confirm that Agent is able to receive the values from Fleet (for that I need to know the answer to the questions above), we can work on making the necessary changes in Agent to remove the temporary workaround.

juliaElastic commented 2 weeks ago

@ycombinator AFAIK these APM configs are not added to the agent policy, but directly to elastic-agent.yml in cloud, see this issue. @AlexP-Elastic should know the details of how this works in cloud.

ycombinator commented 2 weeks ago

@juliaElastic Right, that's what we are trying to move away from and have the APM configuration be part of the Fleet-managed Agent policy instead 🙂. See https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2263395538.

So is there some way we can make that happen? I was able to get agent.monitoring.traces: true to show up in the Fleet-managed policy (see the screenshot in https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2319333019) but now I'm wondering how I can get the following section also show up under agent.monitoring:

apm:
    hosts:
      - <apm host url>
    environment: <apm environment>
    secret_token: <secret token>

juliaElastic commented 2 weeks ago

I see, it seems an APM config is already there in the cloud policy based on a conditional, so maybe we just need to tweak the condition to enable it on all internal ESS clusters: https://github.com/elastic/cloud-assets/blob/4e9cf8979f57fd08db8a7ebb2b476b852fbd72bf/stackpack/kibana/config/kibana.yml#L315-L343

nchaulet commented 2 weeks ago

To be able to send the APM data the following will be add

xpack.fleet.agentPolicies:
  # Test policy
  - name: Elastic Cloud agent policy
    description: Default agent policy for agents hosted on Elastic Cloud
    id: test-policy
    is_default: false
    is_managed: true
    overrides:
      agent.monitoring:
        apm:
          hosts:
            - <apm host url>
          environment: <apm environment>
          secret_token: <secret token>
    is_default_fleet_server: false
    namespace: default
    monitoring_enabled: ['traces']

@juliaElastic @ycombinator We need to figure where we went those trace to be sent it seems it's not clear, do we want them sent to the APM server running on the deployment (so accessible for the user) or it for our usage and we went to send that to a monitoring cluster?

If it's to a monitoring cluster it should probably be set with something like this https://github.com/elastic/cloud/blob/a8739e38c6184e57fc5a5477f3b795f8fa71b79a/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/KibanaDockerContainer.scala#L569

ycombinator commented 2 weeks ago

We need to figure where we went those trace to be sent it seems it's not clear, do we want them sent to the APM server running on the deployment (so accessible for the user) or it for our usage and we went to send that to a monitoring cluster?

@simitt is probably the best person to answer this question.

simitt commented 1 week ago

The apm data should be sent to the internal Elastic cluster, for support engineers and developers to leverage the data for troubleshooting.

ycombinator commented 1 week ago

Thanks @simitt.

@juliaElastic @nchaulet Given this, do you know what the policy should use for the values of <apm host url>, <apm environment>, and <secret token>? And, more importantly, from where/how it should get those values to inject them into the policy?

ycombinator commented 1 week ago

BTW, since I've been testing on ESS QA and feature freeze for ms-113.0 is coming up, I'm going to revert my Cloud PRs (specifically https://github.com/elastic/cloud/pull/130605, https://github.com/elastic/cloud/pull/130769, and https://github.com/elastic/cloud-assets/pull/1569) for now, so these changes don't accidentally get released to production. Once ms-113.0 has been released, I will reintroduce these (and any other changes based on the answer to the previous comment) and resume testing in ESS QA.

juliaElastic commented 1 week ago

I'm not sure how to reference the internal Elastic cluster in kibana config. @simitt Could you help with that?

simitt commented 1 week ago

I think @AlexP-Elastic might be able to provide the details here, as ES and Kibana are already shipping tracing data to cloud regional clusters.

AlexP-Elastic commented 1 week ago

I'm not sure how to reference the internal Elastic cluster in kibana config

The Elastic cluster is injected by control plane into the templated config files in the stack pack: https://github.com/elastic/cloud/blob/master/scala-services/runner/src/main/scala/no/found/runner/allocation/stateless/KibanaDockerContainer.scala#L561-L589

for regional values, and

https://github.com/elastic/cloud/blob/ca0c6cabbf811a9fc2da3d5fe71738c56679e563/scala-services/adminconsole/src/main/scala/no/found/adminconsole/api/v1/deployments/services/controlplanesettings/providers/kibana/KibanaInternalApmSettings.scala#L49-L57

for global values

so assuming I'm understanding correctly and you want some Kibana code to inject them into the APM policy, easiest would be if you could reference elastic.apm.serverUrl / elastic.apm.secretToken / elastic.apm.environment from the YAML

(there's an additional complication in getting the fields that are needed to bypass IP filtering)

juliaElastic commented 1 week ago

so assuming I'm understanding correctly and you want some Kibana code to inject them into the APM policy, easiest would be if you could reference elastic.apm.serverUrl / elastic.apm.secretToken / elastic.apm.environment from the YAML

Do these values already available to reference in kibana.yml, or is there a code change needed to make them work?

AlexP-Elastic commented 1 week ago

These values are already in Kibana YAML (for versions of Kibana that support them)

You can't see them in the stackpack you are linked because they are injected by the control plane infrastructure explicitly (not via the templating we use for defaults)

juliaElastic commented 5 days ago

I tested locally to set monitoring in a preconfigured policy with overrides as Nicolas suggested here: https://github.com/elastic/elastic-agent/issues/5204#issuecomment-2321621585 It seems to work, so I'm going to add this to the cloud config.

juliaElastic commented 4 days ago

@AlexP-Elastic I tested the change in the latest 8.16-SNAPSHOT and it seems the substitutions of elastic.apm.serverUrl / elastic.apm.secretToken / elastic.apm.environment didn't work. How can we fix this? Instance: https://staging.found.no/deployments/edc7201fc9a5448982955f7017e02106

agent:
  monitoring:
    enabled: true
    traces: true
    apm:
      hosts:
        - ''
      environment: null
      secret_token: null

AlexP-Elastic commented 4 days ago

Here are extracts from the Kibana YAML file for that cluster (assuming I'm correctly understanding what you were doing - reading entries from the Kibana YAML and using that to fill in fields in a policy dynamically?)

elastic.apm.serverUrl: "https://27111427665b4b01a36414f98516aa81.apm.eastus2.staging.azure.foundit.no:443"
elastic.apm.secretToken: "REDACTED"
elastic.apm.serviceNodeName: "edc7201fc9a5448982955f7017e02106-kibana-instance-0000000000"
elastic.apm.globalLabels.deploymentId: "edc7201fc9a5448982955f7017e02106"
elastic.apm.globalLabels.deploymentName: "jb test 8.16 sept 10"
elastic.apm.globalLabels.organizationId: "922053753"

It seems like if this (this=reading fields from YAML) works locally on a laptop, it should work everywhere?

One thing I don't fully understand is what the goal is here?

The elastic-agent YAML local to the APM container has the right config also:

agent.monitoring.apm.hosts:
- "https://27111427665b4b01a36414f98516aa81.apm.eastus2.staging.azure.foundit.no:443"
agent.monitoring.apm.secret_token: "REDACTED"
agent.monitoring.apm.global_labels.serviceNodeName: "edc7201fc9a5448982955f7017e02106-apm-instance-0000000000"
agent.monitoring.apm.global_labels.deploymentId: "edc7201fc9a5448982955f7017e02106"
agent.monitoring.apm.global_labels.deploymentName: "jb test 8.16 sept 10"
agent.monitoring.apm.global_labels.organizationId: "922053753"

Is it not preferable for the agent to inject this config to APM?

AlexP-Elastic commented 4 days ago

Oh no, I just realized I was guilty of totally not understanding what you were doing and giving your last PR a distracted LGTM instead of reading it :(

Sorry about that

what I thought you were proposing to do was have Kibana inject the fields into the policy (and the PR https://github.com/elastic/cloud-assets/pull/1573/files was just adding some placeholder fields)

What it actually does

      agent.monitoring:
        apm:
          hosts:
            - "{{ elastic.apm.serverUrl }}"
          environment: {{ elastic.apm.environment }}
          secret_token: {{ elastic.apm.secretToken }}

doesn't work at all because elastic.apm.serverUrl isn't an actual variable in the templating (it's part of the rendered YAML, which cannot be used to render the same YAML :) )

Hang on I need to think about this a moment, now I know you are trying to do it all via templates and not via code in Kibana

juliaElastic commented 4 days ago

@AlexP-Elastic No worries, we could inject the fields from kibana too, I'm just not sure where to take these values from in kibana.

AlexP-Elastic commented 4 days ago

This is one option: https://github.com/elastic/cloud/pull/131470/files .. I think this is preferable to writing code inside Kibana, if we have to do it using the existing settings

I think my preferred architectural solution would be for the APM container to take the values injected into elastic-agent.yaml and merge them with the policy ... but presumably that is not easy to do?

juliaElastic commented 4 days ago

I think my preferred architectural solution would be for the APM container to take the values injected into elastic-agent.yaml and merge them with the policy ... but presumably that is not easy to do?

Yeah I'm not sure if we could change the preconfiguration or call the kibana Fleet API from the APM container to modify the cloud agent policy.

kpollich commented 2 days ago

@juliaElastic - Is this blocked because the root cause fix here would involve makes changes to APM itself, or because we are waiting on https://github.com/elastic/cloud/pull/131470?

juliaElastic commented 2 days ago

I'm waiting to see if https://github.com/elastic/cloud/pull/131470 works, otherwise we will need to get some help from the APM team to see if we can add the config from the APM container.

AlexP-Elastic commented 2 days ago

I'm just testing https://github.com/elastic/cloud/pull/131470 now, I think we (control plane) are happy to go forward with this as the plan, so once it's working we'll get it merged (hopefully by the end of the week) and you can follow the kibana.yml in that PR to make the changes to cloud-assets (which is what ESS runs)

elastic / elastic-agent

APM config from local configuration before enrollment is lost #5204