elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 523 forks source link

Multiple output support for apm-server (for multiple viewers) #8886

Open matschaffer opened 2 years ago

matschaffer commented 2 years ago

We have a couple of efforts in flight that are (or might) use APM as a delivery mechanism for important stack data.

For example:

As these shape up we'll reach a point where we want access to APM-delivered data for both internal debugging/support purposes as well as external self-service operational purposes by people using Elastic products in their own orgs.

One option for making this data available could be to have apm-server able to output to two Elasticsearch deployments (internal + external).

There may be other options as well. If we have something viable we can close this in favor of whatever we find.

axw commented 2 years ago

One option for making this data available could be to have apm-server able to output to two Elasticsearch deployments (internal + external).

One major complication is that APM Server is currently stack-version aligned. It's expected that Elasticsearch is at least as new as APM Server, and that it has an integration package installed that is at least as new as APM Server. As things are today, this would likely make it difficult to send data to a shared internal cluster, which would have to be kept up-to-date at all times.

As it so happens, we've been talking about maybe changing things a bit so that APM Server is aware of the target integration version, and would adjust the documents it writes accordingly. So we could in theory have two outputs which write the docs in different ways, depending on the target ES/integration package versions.

matschaffer commented 2 years ago

As things are today, this would likely make it difficult to send data to a shared internal cluster, which would have to be kept up-to-date at all times.

Interesting! APM traces for kibana on ESS are only going to a shared regional cluster today. It's the external-facing point that we'll need to address eventually.

joshdover commented 2 years ago

I discussed this problem today with @AlexP-Elastic and there was a strong preference to be able to send to multiple outputs from the APM agent level, rather than from APM Server. This simplifies operations significantly as APM Server won't need be able to authorize and route to N number of ES clusters (which are constantly changing).

The downside would be that duplicate processing gets done, but I think in general its likely to be more preferable for ESS as it would also allow customers to define their own APM settings, like tail-based sampling policies.

joshdover commented 2 years ago

cc @felixbarny has there been any discussion about supporting this from the agent level?

felixbarny commented 2 years ago

cc @gregkalapos who's the new APM agent tech lead.

My 2c are that we should not add an ability to APM Agents to multiplex data to multiple APM Servers. That wouldn't be trivial, comes with a performance penalty, and we'd need to duplicate the effort for all agents. Architecturally, I think it's cleaner if APM Agents send the data to one location which can then possibly route it to different outputs.

joshdover commented 2 years ago

Thanks for the response, Felix. I don't think either option is trivial 😄

The APM Server routing option (this ticket) could be more feasible if we were to send metrics/traces to the customer's APM Server first, and then have that server configured to duplicate to our internal clusters as well as the customer's. This avoids the potential issue of mixing multiple customers' data and then having to conditionally route back out to their clusters. A few downsides:

1) our own visibility into customer deployments is now dependent on the health of the customer's cluster; 2) additional CPU and memory usage would be required on the customer's APM Server instance; 3) as mentioned above, it's unclear how we would configure features like different tail-based sampling policies when sending to multiple clusters.

comes with a performance penalty

Do we have a way to quantify this? My assumption is that the bulk of an APM agent's overhead comes from the actual instrumentation, stack trace capturing, and processing, and not the sending of data.

felixbarny commented 1 year ago

I think we need to have a high-bandwidth discussion about this. So many things that play into this and that I don't have full clarity around.

Some unordered thoughts