elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 522 forks source link

Introduce host.id attribute to traces #4368

Open alex-fedotyev opened 3 years ago

alex-fedotyev commented 3 years ago

Metrics and logs identified a problem of using host.name for correlation when ingesting data from cloud environments, as they don't provide proper host name. Proposed solution is to introduce host.id field which is "calculated" and is equal host.name for on-premises environments, and for cloud it is equal cloud.instance.id. Original issue and spreadsheet with the breakdown.

This seems to align well with OTel spec, as they are using cloud instance_id as the host.id.

The proposal for APM is to calculate host.id dynamically based on whether cloud metadata is present or using host.name otherwise. We would leverage this when integrating products together, i.e. linking from Infra to APM and vice versa.

We would also need to recognize host.id when ingesting data from OTel.

CC: @graphaelli @felixbarny

alex-fedotyev commented 3 years ago

Pinging: @kaiyan-sheng @exekias @sorantis I just realized that OTel spec suggests using cloud.instance.id while we suggest using cloud.instance.name.

Are those fields the same? Or would it make more sense to align around cloud.instance.id?

axw commented 3 years ago

The OTel spec is vague in the non-cloud case though. In that case what is the unique ID? Is it /etc/machine-id or is it FQDN...?

exekias commented 3 years ago

I just realized that OTel spec suggests using cloud.instance.id while we suggest using cloud.instance.name.

We are suggesting cloud.instance.id too, see https://github.com/elastic/observability-dev/pull/1137/files#diff-c5a9ab0ff94fc3963d0bb04177a5a800457970a01608274951e8a6a0b0023057R40

The OTel spec is vague in the non-cloud case though. In that case what is the unique ID? Is it /etc/machine-id or is it FQDN...?

I would say FQDN works better, machine-id can only retrieved from inside the machine, so while it guarantees to be unique, it's not very useful for correlation (specifically to correlate events coming from monitoring the machine from outside vs inside).

alex-fedotyev commented 3 years ago

The OTel spec is vague in the non-cloud case though. In that case what is the unique ID? Is it /etc/machine-id or is it FQDN...?

@cyrille-leclerc - any chance you know how OTel defines host.id in non-cloud environments?

kaiyan-sheng commented 3 years ago

I just realized that OTel spec suggests using cloud.instance.id while we suggest using cloud.instance.name.

Are those fields the same? Or would it make more sense to align around cloud.instance.id?

Yes we are also using cloud.instance.id. Problem with using cloud.instance.name is, it is not a required field in some of the cloud providers. For example, in AWS EC2, instance name is not required and defined by tag Name.

cyrille-leclerc commented 3 years ago

@axw my understanding is that the only host information we collect in OpenTelemetry traces is host.id and only when when there is a network communication, mapping the Otel net.* namespace.

I collected the documents of the transaction and all the spans of a trace. Unfortunately, everything runs on my local Macbook without Docker making it more difficult to understand the usage of the host.hostname, host.ip... attributes as everything is localhost/127.0.0.1. See https://gist.github.com/cyrille-leclerc/e5b4a1fb214f83cc9e7819953ebbd3e3 I only found 2 occurences of host on span documents, on the connection spans.

@axw Could we have omitted to map other Otel host attributes?

I looked at https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/master/exporter/elasticexporter/internal/translator/elastic/traces.go but I didn't find any hint.

axw commented 3 years ago

@axw Could we have omitted to map other Otel host attributes?

Yes; what is there is not comprehensive. We will need to add support for translating host.id, among others.

cyrille-leclerc commented 3 years ago

@alex-fedotyev OpenTelemetry host.id is NOT defined by the OpenTelemetry collector outside of cloud deployments. I only found enrichment of host.idon AWS and GCP so far.


Research notes

axw commented 3 years ago

https://github.com/elastic/apm-server/pull/4955 will add host.id for OpenTelemetry data.

We still need some conclusion on what to do for our agents. We could just set it to cloud instance ID for now, when it's set.

I would say FQDN works better, machine-id can only retrieved from inside the machine, so while it guarantees to be unique, it's not very useful for correlation (specifically to correlate events coming from monitoring the machine from outside vs inside).

@exekias does beats already do this? I just took a quick look and it appears to be using go-sysinfo's "HostInfo.UniqueID", which is populated using machine-id.

exekias commented 3 years ago

Not yet, right now beats report host.id as the machine id, so we will need to do a breaking change, or introduce the change directly in the agent. @kaiyan-sheng I think you had an issue to discuss this?

kaiyan-sheng commented 3 years ago

Sorry I just saw this message 🤕 Yes here is the issue: https://github.com/elastic/beats/issues/22739