divviup / janus

Experimental implementation of the Distributed Aggregation Protocol (DAP) specification.
Mozilla Public License 2.0
52 stars 14 forks source link

OpenTelemetry trace propagation #1628

Open divergentdave opened 1 year ago

divergentdave commented 1 year ago

We should enable OpenTelemetry tracing systems to link spans for collection requests to spans in the collection job driver, and spans in the aggregation job creator and aggregation job driver.

I'm still wrapping my head around the available interfaces. Most distributed tracing use cases are HTTP-centric, and propagate this information through the tracestate and traceparent HTTP headers. If possible, I'd like to just use aggregation job IDs and collection job IDs as a trace ID directly. (cutting corners by not propagating per-system IDs) If that doesn't work out, worst case we can add a jsonb column to each job table to act as a carrier.

divergentdave commented 1 year ago

I think we do need jsonb columns in the database for trait propagation after all. For two spans to be connected, they need to share a 16-byte trace ID and 8-byte span ID (either on the span itself and its parent span ID, or in a link). Additionally, we need to be able to transfer the span flags field in order to make sampling work correctly. For example, the aggregator process could be configured to sample 1% of its root spans, then the "sampled" flag in any one collection job's trace will determine whether all subsequent related job processing spans are exported or not.