elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
384 stars 114 forks source link

Otel Tracing Spec: Use netName in messaging target name if it's available #787

Closed dmathieu closed 1 year ago

dmathieu commented 1 year ago

I'm not sure the template applies, as this spec change is more of a clarification, and agents would either not do anything, or have already handle that lack of information.

When handling messaging traces, if there is no messaging.destination, we may use messaging.url, or netName as the target name instead. We previously had to set netName, but it was never used. So this change uses the set value.

cc @SylvainJuge


SylvainJuge commented 1 year ago

Looks good, as it just provides an extra fallback service name value the impact should be quite minimal. I'll add to the next agents weekly meeting to get more feedback.

gregkalapos commented 1 year ago

I also think this is fine.

One thing I'd mention is that we set serviceTargetName here to something generic (in this case it'll be the endpoint based on network attributes). serviceTargetName drives our service map and we identify dependencies based on that. This means if 2 (or more) messaging systems run on the same endpoint and none of them have messaging.destination, then they'll be merged into a single dependency - and we'd use netName in the name of this single dependency.

But I don't see this as a blocker for this PR, since the dependency detection wasn't better prior to this PR either and this is just a fallback - ideally we have messaging.destination or messaging.url.

SylvainJuge commented 1 year ago

When trying to implement this change in the Java agent in https://github.com/elastic/apm-agent-java/pull/3129, I found there was some aspects that I completely overlooked when reviewing this PR.

In the gherkin specs, we have a few strategies for values in target_service_name, and destination resource:

The strategy for DB and messaging spans is inherited from the way the span destination resource was initially implemented, which could explain why the network level fallback are omitted.

So, the change in the gherkin spec for messaging seems a bit inconsistent because we mix the two approaches here. If we want to go in this direction, then maybe it would make sense to also do it for other protocols (which might definitely be tricky to do). Also, maybe covering the 1st and 4th cases where only messaging.url or net.peer.ip is provided would be relevant.

The initial idea for target.service.name and service.target.type would be to replace destination.resource, but the latter was kept for compatibility reasons. For example, for DB spans having service.target.{type,name} allows to reconstruct an equivalent destination.resource (the UI map and service dependencies might still rely on this).

In an ideal situation, we should not have to deal with such mapping in the agents, but we have a few agent features that rely on resource/service target type,name. I wonder if we could find a way to make it rely only on "set of otel field names" rather than having such tedious specification.

@dmathieu what do you think we can do to improve this ?

dmathieu commented 1 year ago

I agree that in an ideal situation, we shouldn't have to do this kind of mapping. The ingest (or UI) should know about semantic conventions (/ECS), and derive data from those appropriate fields.

I'm not sure I get the full context of this issue. The main reason for this change here was that netName was set, but never used for messaging, so it made sense to use it the same way as the spec specifies for HTTP spans. If that made more sense, we could also entirely remove the notion of netName in messaging spans.

SylvainJuge commented 1 year ago

Yes, I think it would make more sense (and be simpler) to leave netName outside of the messaging spans for now, let's discuss this in next agents meeting.

SylvainJuge commented 1 year ago

@dmathieu During last agents weekly meeting, I did not manage to explain why I think we should probably be better off keeping this as-is (as it was before this PR), so here are my thoughts on this change. Also, we could add a special case for otel bridge as a compromise, however that would further complicate the destination.resource field specification as it is expected to be derived from service.target.type and service.target.name.

If you have ideas how to better handle this, I am definitely interested. Ideally we should be able to make the agents and apm-server mapping easier to maintain and independent from UI, but that's not really something we can tackle right now.