elastic / ecs-logging-java

https://www.elastic.co/guide/en/ecs-logging/java/current/intro.html
Apache License 2.0
139 stars 74 forks source link

support for OTel APM trace_id and span_id #214

Open ty-elastic opened 9 months ago

ty-elastic commented 9 months ago

Assume you want to use the OTel APM library (for tracing) along with this ECS logging library for ECS-formatted logs (to be ingested, for example, via Filebeat). In Elasticsearch, we want correlation between OTel APM traces and app logs.

The OTel APM library (via OTel's Logback MDC) today puts trace_id and span_id on the MDC (good), but of course today, these field names are not ECS-compliant (e.g., "trace.id" and "span.id"). To allow Elasticsearch to correlate today, we need to translate these fields either in Filebeat (Agent) or in an ingest pipeline.

One solution might be to convince OTel's Logback MDC to adopt a flexible naming convention, or to align to ECS.

That said, given that "trace.id" and "span.id" are first-class ECS fields, it kind of feels like the ECS logging library should explicitly handle (special case) these fields, rather than just assuming they are named correctly on the MDC (e.g., as our own APM library would do). I saw another PR that suggested prefixing MDC field with "label" (make sense), but that would of course further obscure trace.id and span.id.

Should we consider special handling for "trace.id" and "span.id" (always top-level), perhaps with specific code to pick these up from our own APM library (basically pass-through copy from the MDC) and from OTel (with translation from trace_id and span_id)?

SylvainJuge commented 9 months ago

Hi @ty-elastic ,

in ECS-logging, the implementation is just "take anything in MDC and write it as-is", thus the responsibility for taking the trace.id, transaction.id and span.id is on the APM agent side.

Thus in this case making the OTel agent able to configure the keys is probably the best option, I have added a proposal for the configuration here: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/9390#issuecomment-1733663945, and it is to be something we could contribute.

SylvainJuge commented 9 months ago

Another aspect that could also be contributed to OTel agent is the ability to inject the "service level" attributes in the MDC, those are called resource attributes in otel.

The OTLP log format already have a defined field for resources in a log event, and even service.name , however the resource map is not injected in the MDC (yet), which could be used in our case to copy those attributes in the ECS log output.

ty-elastic commented 9 months ago

Hey @SylvainJuge , thanks for getting back to me.

Yeah - I think a contribution on the OTel side makes sense here, in the spirit of us, in general, providing best-in-class OTel support.

That said, I guess I view (in modern apps) APM data as "first-class", and since these are first-class ECS fields, it sort of feels like maybe an ECS-logging library should take care of mapping these attributes to ECS, rather than relying on something upstream to produce ECS-compliant fields? Sort of a "the buck stops here" mentality with regard to providing ECS-compliant output. From that perspective, one could argue that our library should have some ability to map any incoming MDC fields to ECS fields, even beyond APM (e.g., mapping some arbitrary key containing service.version to service.version). thoughts?

felixbarny commented 9 months ago

From that perspective, one could argue that our library should have some ability to map any incoming MDC fields to ECS fields, even beyond APM (e.g., mapping some arbitrary key containing service.version to service.version). thoughts?

To me, that sounds like a job for an ingest pipeline rather than the ECS logging library.

Long-term I think it makes sense to donate ECS loggers to OTel. In the context of the ECS/OTel semantic convention merger, there should be a decision on whether the canonical name is trace.id or trace_id.

@AlexanderWert has that been discussed already?

AlexanderWert commented 9 months ago

@AlexanderWert has that been discussed already?

Yes, those will stay trace_id, span_id in non-OTLP log formats:

felixbarny commented 9 months ago

Does that imply that we should change trace.id to trace_id in ECS and in the documents that APM Server produces?

ty-elastic commented 9 months ago

@felixbarny ,

To me, that sounds like a job for an ingest pipeline rather than the ECS logging library.

yeah - I could see it either way. To me, I would expect an ECS logging library to take-on the burden of mapping local app parlance to ECS, if an ECS field is available for a given field that the local app maintains. Else, it starts to feel like this library maps some things, but then I also need an ingest pipeline to map other things => 2 'things' to maintain.

That said, whatever is decided, it seems like ideally there is at least very good support for including/mapping OTel trace_id and span_id in log statements without requiring an ingest pipeline, in the spirit of supporting OTel APM Agents. Whether that is done with a PR to the OTel logging lib or is done via mapping in this library, that would help us with customers who want to use OTel for APM but Elastic (native) for logging (since OTel logging is still largely beta, and has a different design pattern than customers have in their existing elastic-logging infra).

felixbarny commented 9 months ago

I would expect an ECS logging library to take-on the burden of mapping local app parlance to ECS, if an ECS field is available for a given field that the local app maintains

To me, it's more about mapping well-known attributes to ECS. I definitely agree that there needs to be a solution for the trace.id vs trace_id problem. However, it seems to me that the ECS/OTel semconv merger is responsible for solving this. When it comes to application-specific attributes, I don't think ECS loggers are responsible for the mapping. That seems to be better managed centrally in an ingest pipeline.

That said, whatever is decided, it seems like ideally there is at least very good support for including/mapping OTel trace_id and span_id in log statements without requiring an ingest pipeline, in the spirit of supporting OTel APM Agents.

Fully agree here. Maybe a short-term solution here could be to re-write trace_id to trace.id. However, if the end-goal is to fully align with OTel, we may need to think about re-writing trace.id to trace_id. But that would not only affect log correlation. We also need to think about how migration would work here.

jack78901 commented 8 months ago

The library responsible for assisting in helping logging compliance with ECS should help to do just that with trace.id and span.id. I thought that is what the ecs-logging-java was supposed to do?

Why would we require a pipeline to do this?

As a user of this library, I would love not to have to involve a DevOps team to manage a pipeline if I could handle this in code.

jack78901 commented 8 months ago

Also, as a point, OpenTelemetry, even with the advent of ECS being merged with the Semantic Conventions, will not resolve this issue for us, as they are saying it is not their issue on their end but to be resolved by users as well. This can be seen: https://github.com/open-telemetry/oteps/blob/d02a3e2e75dc934fb38c5db88eb41fbe85730af4/text/0199-support-elastic-common-schema-in-opentelemetry.md#known-differences.