census-ecosystem / opencensus-go-exporter-stackdriver

OpenCensus Go exporter for Stackdriver Monitoring and Trace
Apache License 2.0
67 stars 79 forks source link

trace: allow for concurrent uploads to Stackdriver #245

Closed nicktrav closed 4 years ago

nicktrav commented 4 years ago

Is your feature request related to a problem? Please describe.

When a process is exporting a large amount of spans (O(100s) per second), the trace exporter can get backed up, resulting in large number of goroutines scheduled and blocking on a semaphore. Spans are eventually dropped when the BundleByteLimit is reached, with errors like the following visible in the logs:

2019-12-24T16:18:04.395695Z     info    OpenCensus Stackdriver exporter: failed to upload 376 spans: buffer full
2019-12-24T16:18:09.395916Z     info    OpenCensus Stackdriver exporter: failed to upload 381 spans: buffer full
2019-12-24T16:18:14.396216Z     info    OpenCensus Stackdriver exporter: failed to upload 308 spans: buffer full
2019-12-24T16:18:19.396421Z     info    OpenCensus Stackdriver exporter: failed to upload 332 spans: buffer full

This problem appears to be (in part) due to the fact that the exporter only allows for a single in-flight request to be made to the Stackdriver monitoring endpoints, via the semaphore on the Bundler:

for !(ticket == b.nextHandled && len(b.active) < b.HandlerLimit) {
    b.cond.Wait()
}

Describe the solution you'd like

Allow for the trace exporter to have a configurable number of "workers", making use of Options.NumberOfWorkers (added in v0.12.8) to configure a maximum number of concurrent goroutines that can upload spans to Stackdriver.

Describe alternatives you've considered

Increasing the buffer size of the trace exporter helps (via Options.BundleCountThreshold), as there is a material fixed cost in exporting to Stackdriver (measured as a few hundred milliseconds on GCP). Larger batches result in better throughput, but this doesn't solve for the single in-flight request limitation.

Additional context

To provide some more context, this is currently (as of 1.4.2) an issue in Istio's "mixer" component, which serves as an aggregation point for tracing data sent from remote proxy instances. Mixer is responsible for uploading the spans to Stackdriver. The concurrency bottleneck is apparent under high load.