Otel Tracing Spec: Use netName in messaging target name if it's available

dmathieu commented 1 year ago

I'm not sure the template applies, as this spec change is more of a clarification, and agents would either not do anything, or have already handle that lack of information.

When handling messaging traces, if there is no messaging.destination, we may use messaging.url, or netName as the target name instead. We previously had to set netName, but it was never used. So this change uses the set value.

cc @SylvainJuge

[ ] Create PR as draft
[ ] Approval by at least one other agent
[ ] Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
- Remove PM from reviewers if impact on product is negligible
- Remove agents from reviewers if the change is not relevant for them
[ ] Merge after 2 business days passed without objections \ To auto-merge the PR, add /schedule YYYY-MM-DD to the PR description.

May the instrumentation collect sensitive information, such as secrets or PII (ex. in headers)?
- [ ] Yes
- [ ] Add a section to the spec how agents should apply sanitization (such as sanitize_field_names)
- [ ] No
- [ ] Why?
- [ ] n/a
[ ] Create PR as draft
[ ] Approval by at least one other agent
[ ] Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
- Remove PM from reviewers if impact on product is negligible
- Remove agents from reviewers if the change is not relevant for them
[ ] Approved by at least 2 agents + PM (if relevant)
[ ] Merge after 7 days passed without objections \ To auto-merge the PR, add /schedule YYYY-MM-DD to the PR description.
[ ] Create implementation issues through the meta issue template (this will automate issue creation for individual agents)
[ ] If this spec adds a new dynamic config option, add it to central config.

SylvainJuge commented 1 year ago

Looks good, as it just provides an extra fallback service name value the impact should be quite minimal. I'll add to the next agents weekly meeting to get more feedback.

gregkalapos commented 1 year ago

I also think this is fine.

One thing I'd mention is that we set serviceTargetName here to something generic (in this case it'll be the endpoint based on network attributes). serviceTargetName drives our service map and we identify dependencies based on that. This means if 2 (or more) messaging systems run on the same endpoint and none of them have messaging.destination, then they'll be merged into a single dependency - and we'd use netName in the name of this single dependency.

But I don't see this as a blocker for this PR, since the dependency detection wasn't better prior to this PR either and this is just a fallback - ideally we have messaging.destination or messaging.url.

SylvainJuge commented 1 year ago

When trying to implement this change in the Java agent in https://github.com/elastic/apm-agent-java/pull/3129, I found there was some aspects that I completely overlooked when reviewing this PR.

In the gherkin specs, we have a few strategies for values in target_service_name, and destination resource:

use the network host/IP as an identifier in the protocol: HTTP (part of the URL) and RPC are fitting this model
only use an "high-level" value: DB and messaging (at least before this PR) fits this model

The strategy for DB and messaging spans is inherited from the way the span destination resource was initially implemented, which could explain why the network level fallback are omitted.

So, the change in the gherkin spec for messaging seems a bit inconsistent because we mix the two approaches here. If we want to go in this direction, then maybe it would make sense to also do it for other protocols (which might definitely be tricky to do). Also, maybe covering the 1st and 4th cases where only messaging.url or net.peer.ip is provided would be relevant.

The initial idea for target.service.name and service.target.type would be to replace destination.resource, but the latter was kept for compatibility reasons. For example, for DB spans having service.target.{type,name} allows to reconstruct an equivalent destination.resource (the UI map and service dependencies might still rely on this).

In an ideal situation, we should not have to deal with such mapping in the agents, but we have a few agent features that rely on resource/service target type,name. I wonder if we could find a way to make it rely only on "set of otel field names" rather than having such tedious specification.

@dmathieu what do you think we can do to improve this ?

dmathieu commented 1 year ago

I agree that in an ideal situation, we shouldn't have to do this kind of mapping. The ingest (or UI) should know about semantic conventions (/ECS), and derive data from those appropriate fields.

I'm not sure I get the full context of this issue. The main reason for this change here was that netName was set, but never used for messaging, so it made sense to use it the same way as the spec specifies for HTTP spans. If that made more sense, we could also entirely remove the notion of netName in messaging spans.

SylvainJuge commented 1 year ago

Yes, I think it would make more sense (and be simpler) to leave netName outside of the messaging spans for now, let's discuss this in next agents meeting.

SylvainJuge commented 1 year ago

@dmathieu During last agents weekly meeting, I did not manage to explain why I think we should probably be better off keeping this as-is (as it was before this PR), so here are my thoughts on this change. Also, we could add a special case for otel bridge as a compromise, however that would further complicate the destination.resource field specification as it is expected to be derived from service.target.type and service.target.name.

we need to keep service.target.type and service.target.name in line with destination.service.resource for compatibility with UI and data sent by existing agents.
the spec for destination.service.resource is defined in https://github.com/elastic/apm/blob/main/specs/agents/tracing-spans-destination.md#destination-resource and the inference algorithm currently only has an exception for external spans.
- adding another exception for messaging spans would also impact our own messaging instrumentation (not just OTel)
- adding an exception just for OTel could be doable, but further complicates the destination.service.resource heuristic, and it would require at least to update the gherkin spec that was changed in this PR.
the (probably optimistic) plan was to deprecate this field and only rely on service.target.{type,name} but it's still required for now.
from experience, changing the spec in this area tends to be tedious (need to change all agents), and also bug prone as it's used by map and service dependencies.
I think further improvement here will be better handled once we make progress on the ECS+OTel merge as the ways to identify services and dependency have been stabilized in semantic conventions (doing it right now would likely having to change things afterwards)
we expect only a few users of the otel bridge to see a difference here, so keeping it consistent for now is probably a better choice to limit maintenance burden.

If you have ideas how to better handle this, I am definitely interested. Ideally we should be able to make the agents and apm-server mapping easier to maintain and independent from UI, but that's not really something we can tackle right now.

elastic / apm

Otel Tracing Spec: Use netName in messaging target name if it's available #787