Tracing misbehaviour when service under load

eclipse-ditto / ditto

Eclipse Ditto™: Digital Twin framework of Eclipse IoT - main repository

https://eclipse.dev/ditto/

Eclipse Public License 2.0

692 stars 225 forks source link

Tracing misbehaviour when service under load #1645

Closed vvasilevbosch closed 1 year ago

vvasilevbosch commented 1 year ago

Incomplete tracing is observed, while load testing the service with modify thing commands, via kafka connection. What can be seen on the trace, is that there are spans with invalid parent span IDs and also a lot of missing spans that should be there. I attach json export of two traces(complete and incomplete) as well as screenshot from jaeger ui. jaeger-invalid-parent-span-ids traces.zip

thjaeckle commented 1 year ago

@vvasilevbosch are you sampling 100% of all requests? And how much load are we talking about?

Because I would assume that some dropping of traced requests is done before the tracing would slow down the functionality of the service or would overwhelm the OTEL endpoint. The used logback logstash appender also does that. Under heavy load, not all log statements might be available.

Maybe this is even configurable in Kamon, the library Ditto uses for tracing. Did you check?

vvasilevbosch commented 1 year ago

@thjaeckle I have the following setup: 1_000_000 things, 8 connectivity,policies,things, 1 things-search and 1 gateway, 1 kafka connection with 8 clients, to which I send 5000 modifyThing messages per second. I will further check the Kamon configuration. Thanks!

thjaeckle commented 1 year ago

Ok, with this load I would expect that you would have to scale your Jaeger backend. Every command will cause at least 5 spans of a trace, reported via at least 3 services in Ditto.

More realistic IMO would be to configure that only eg 1% of the requests are sampled..

vvasilevbosch commented 1 year ago

I tried increasing the buffer size of the tracing reporter, but it seems there is a bug in the Kamon library, I have raised an issue in their repo: https://github.com/kamon-io/Kamon/issues/1281

Closing this issue