Consume APM configuration over the elastic-agent control protocol

joshdover commented 1 year ago

In order to enable APM tracing when APM Server is managed by Elastic Agent, APM Server needs to apply APM configuration settings when they're passed over the control protocol. This is necessary to enable tracing of APM Server in ESS and ECE.

These settings should be used to enable APM Server's self tracing functionality, and send the traces to the configured APM Server passed via the control protocol This may require that we change how the apm.Tracer object is created and passed to internal objects as the config may change during runtime.

References:

elastic-agent-client changes: https://github.com/elastic/elastic-agent-client/pull/82
https://github.com/elastic/elastic-agent/issues/2612

simitt commented 11 months ago

https://github.com/elastic/elastic-agent/issues/2612 has been merged; moving to 8.12-candidate to fit timelines.

kruskall commented 7 months ago

It seems beats is already using the agent control protocol under the hood (https://github.com/elastic/apm-server/pull/12508#discussion_r1469817098) so we can't add an additional client.

We can either stop using the beats manager and implement the feature in apm-server or try to work with beats (I'm not sure if this is possible since beats is hardcoding the values passed to the control protocol).

Another issue is that some of the libraries we use hardcode the tracer (e.g. go-docappender) so we can only stop and recreate the indexer or implement an option to retrieve the indexer dynamically.

Any feedback, opinion or alternative solution ?

axw commented 7 months ago

We can either stop using the beats manager and implement the feature in apm-server or try to work with beats (I'm not sure if this is possible since beats is hardcoding the values passed to the control protocol).

Could we update libbeat to consume the TriggeredAPMChange, translate it to YAML settings, and merge that into a config change and update the beat config?

Another issue is that some of the libraries we use hardcode the tracer (e.g. go-docappender) so we can only stop and recreate the indexer or implement an option to retrieve the indexer dynamically.

We already hot-reload everything on config changes, so if we do the above then we can construct a new tracer with the new config.

axw commented 7 months ago

@joshdover @cmacknz does my suggest above (to consume TriggeredAPMChange, update the beat config, and trigger a reload) sound sensible? How do you intend to deal with TriggeredAPMChange for Beats? I think this should probably be solved in libbeat, and then apm-server can consume updates from there. Is that something the control plane team can take on?

cmacknz commented 7 months ago

Taking a closer look at this, the APM configuration is being received in libbeat but it is not exposed in any way.

The APMConfig we need is attached to each unit that may have changed already as part of the expected state, we just don't do anything with it.

https://github.com/elastic/elastic-agent-client/blob/16361c4c4895b633926dcf784ed90f92c985a1a3/pkg/client/unit.go#L156-L163

// Expected contains the expected state, log level, features and config for a unit.
type Expected struct {
    Config    *proto.UnitExpectedConfig
    Features  *proto.Features
    LogLevel  UnitLogLevel
    State     UnitState
    APMConfig *proto.APMConfig
}

Here it is in our management code at the point where we reload the Beat configurations. https://github.com/elastic/beats/blob/7546ae1ceefa19c9a4886ab012fa661eaeaccf4b/x-pack/libbeat/management/managerV2.go#L568-L569

    for _, unit := range units {
        expected := unit.Expected()

The question I have is what we should do with the APM configuration at this point? In what form can APM server actually use it? Our agent client is quite isolated from the rest of Beats, so at the point where we receive the APM configuration we could change a Beat global tracing configuration or add a way to watch changes to it externally.

With respect to TriggeredAPMChange the intent is to allow us to tell when the APM configuration has actually changed when we receive a new configuration from agent. This is something we would want to be looking at in the block below we wanted to build an API for externally observing changes to the tracing configuration outside of the management client:

https://github.com/elastic/beats/blob/7546ae1ceefa19c9a4886ab012fa661eaeaccf4b/x-pack/libbeat/management/managerV2.go#L494-L507

            switch change.Type {
            // Within the context of how we send config to beats, I'm not sure if there is a difference between
            // A unit add and a unit change, since either way we can't do much more than call the reloader
            case client.UnitChangedAdded:
                cm.addUnit(change.Unit)
                // reset can be called here because `<-t.C` is handled in the same select
                t.Reset(cm.changeDebounce)
            case client.UnitChangedModified:
                cm.modifyUnit(change.Unit)
                // reset can be called here because `<-t.C` is handled in the same select
                t.Reset(cm.changeDebounce)

This Beat management code was originally written before the change triggers existed so it doesn't make use of them yet.

axw commented 7 months ago

The question I have is what we should do with the APM configuration at this point? In what form can APM server actually use it? Our agent client is quite isolated from the rest of Beats, so at the point where we receive the APM configuration we could change a Beat global tracing configuration or add a way to watch changes to it externally.

What do you think about introducing a new libbeat/common/reload registry key (say "apm"), and calling any registered reloader(s) when TriggeredAPMChange is received? We would need to marshal it to a config.C, and I think we could just use the original Elastic Agent config YAML structure for that.

In APM Server we would update https://github.com/elastic/apm-server/blob/17969f98f0c54727d8c104af0d03f40b32f43369/internal/beatcmd/reloader.go#L68-L73 to also watch for APM config changes, and hot-reload the process like we already do for input & output config changes.

simitt commented 7 months ago

Moving this out of 8.13 scope due to the uncovered issues. @cmacknz is it reasonable for the EA team to make the required changes (once there is agreement on a path forward) early enough in 8.14 dev cycle so that we can ensure this works gets into apm-server 8.14?

cmacknz commented 7 months ago

Hooking into the existing libbeat configuration reload API makes sense to me as a way to expose this.

Our team could do this, answering when we could do it I'll have to leave with @pierrehilbert.

simitt commented 6 months ago

@pierrehilbert could you please provide an update when this could be done by the Elastic Agent team? We are eager of enabling apm tracing when in managed mode.

simitt commented 4 months ago

@pierrehilbert any update on this please? We are really eager to finally get this in.

pierrehilbert commented 4 months ago

Really sorry @simitt I missed the first mentions. @pchila will probably be the best candidate to spend time on this as he already spend some time on the APM topic. What would be the ideal realistic target to have this done? cc @ycombinator as @pchila will be in his team with the new org on Agent side.

simitt commented 4 months ago

What would be the ideal realistic target to have this done?

Not certain if the question is directed towards @pchila or me, but from our end, the sooner the better. If you can aim for providing functionality in 8.15, so we can do the follow up in 8.16, it would be good.

ycombinator commented 4 months ago

Created https://github.com/elastic/beats/issues/39606 to summarize the discussion between @cmacknz and @axw and track this request in the Ingest team's roadmap.

blakerouse commented 2 months ago

I have PR https://github.com/elastic/beats/pull/40030 up for review. You can start with that PR as a base until it is merged, to implement the feature you need in APM server. If you have any questions just let me know and I can help.

kyungeunni commented 2 months ago

I've done the manual testing by running the APM Server in managed mode with locally built Elastic Agent image and left the result in the comment.

simitt commented 2 months ago

@kyungeunni please ensure with your testing that the self instrumentation can be used in the ESS setup. That is the main use case for us to be able to leverage apm self instrumentation when troubleshooting SDHs.

kyungeunni commented 2 months ago

I'll test this in ESS.

Just sharing some notes:

Fleet-server uses the same APMConfig and seems to have the tracing enabled on their side, not relying on instrumentation package but having their own. Here's the link to the doc for the APM Tracing.

The configs used to make the tracing work on fleet-server are as follows:

Hosts
SecretToken
Global Labels

In the current state, we will be able to send the tracing based on what I've tested locally, but we won't be able to set the Global Labels so we need extra work on this part.

I see three options to support global labels:

set ELASTIC_APM_GLOBAL_LABELS env var before creating a new tracer using instrumentation package from beats (needs to be verified if it works or no)
update beats' instrumentation package to be somewhat similar to what is implemented in fleet-server
have our own implementation of instrumentation like fleet-server does so we don't need to rely on beats.

1 is quick and short, but it's not declarative and a bit hacky. if we choose to go with 2 or 3, we will be able to parse and support more config values coming from APMConfig (e.g. TLS)

axw commented 2 months ago

1 is quick and short, but it's not declarative and a bit hacky. if we choose to go with 2 or 3, we will be able to parse and support more config values coming from APMConfig (e.g. TLS)

If (3) is feasible without significantly more effort, then I would be in favour of that, to reduce our dependency on libbeat.

simitt commented 2 months ago

Required follow up tasks to make this fully work are created and assigned:

1pkg commented 2 months ago

Validated as a part of https://github.com/elastic/apm-server/issues/13604

elastic / apm-server

Consume APM configuration over the elastic-agent control protocol #11381