Detect and acknowledge inter-worker imbalances

oschaaf commented 5 years ago

Perhaps it would be nice to add a feature which would make NH spotlight worker-local latencies/counters, when they significantly diverge from what they look like from a global/aggregated perspective in the output.

/cc @jmarantz @htuch

mum4k commented 4 years ago

@oschaaf could you expand the description a bit to help me understand what the issue / scope is?

oschaaf commented 4 years ago

It’s not super uncommon that worker x gets lucky and observes better latencies then worker y. When the difference is significant we may want to call that out. It can imply a noisy run, but when that is not the case it is an interesting piece of information. We do report per worker in our proto/json output so this would be a UX thing for the CLI I guess. And maybe a log warning.

A next step would be to also do this per connection, which is a little more targeted, but I’d more work. Because this might arise because of unfair distribution of capacity at the connection level over at our test target. Possibly just using a single connection per worker with lots of thread/workers may suffice here to check this scenario.

mum4k commented 4 years ago

Thank you. On the topic of doing this per connection. You are suggesting to use a single connection per worker. How do workers utilize connections today? I don't fully follow the second imbalance you described.

oschaaf commented 4 years ago

Sorry that was not super clear, let me attempt to clarify:

We have one pool per worker. Now we also report statistics per worker. So if one configures workers to use a single connection, that makes us effectively report statistics per connection as well.

htuch commented 4 years ago

@oschaaf agreed, per-worker stats and some outlier detection would be nice to have.

envoyproxy / nighthawk

Detect and acknowledge inter-worker imbalances #142