GoogleCloudPlatform / opentelemetry-operations-go

Apache License 2.0
134 stars 102 forks source link

Recommended span batch write size #871

Closed jameshartig closed 6 days ago

jameshartig commented 3 months ago

The trace Readme.md doesn't call WithMaxExportBatchSize which means the export batch size is the default, 512. This means that every 512 spans the batch span processor will synchronously [1] export the spans during which new spans will be queued up to the MaxQueueSize which defaults to 2048. If you are regularly doing 1000s of spans per second you will end up calling batchWriteSpans several times a second needlessly using up the 4,800/minute quota [2] and if any of those RPC calls takes more than a couple hundred milliseconds you will end up dropping spans.

The quotas page [2] says that the "Maximum number of spans per PatchTraces call" is 25,000 (assuming that PatchTraces is the same as batchWriteSpans). Does that mean we can set the batch export size to 25,000? I know that many of Google's RPC methods have internal request size limits and I'm concerned that setting it at 25,000 might trip the limit. Is there a recommended value?

There was a discussion [3] a few years ago where it seemed like it was settled on 6k for that particular case mostly because higher batch sizes were slower. Is that still the case now? I also saw that the batching implementation was replaced since then which means there isn't a need for increasing the grpc pool size since there isn't more than 1 outstanding RPC at a time when using the batch span processor, correct?

We have replaced the batch span processor internally with our own implementation to help improve the situation but I still believe the recommendations here are relevant for either one.

[1] https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L317 [2] https://cloud.google.com/trace/docs/quotas [3] #456

jameshartig commented 3 months ago

We've gone with a limit of 20,000 for now. I'll report back if we have issues.

aabmass commented 3 months ago

Thanks for raising the issue. We'll try to figure out good defaults and update the documentation.

damemi commented 1 month ago

PatchTraces is the v1 cloud trace API while BatchWriteSpans is v2 which our exporter uses. There doesn't actually seem to be any limit on BatchWriteSpans. I'll work on a summary of @dashpole's performance tuning findings from #456 for the readme