Netflix / mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Apache License 2.0
1.42k stars 202 forks source link

Add a buffering functionality to the SseWorkerConnection #721

Closed crioux-stripe closed 3 weeks ago

crioux-stripe commented 3 weeks ago

Context

The SseWorkerConnection uses a DropOperator with a metric named suffixed with withBuffer set in the SseWorkerConnectionFunction class. The class takes a bufferSize parameter but does nothing with it! I've had the illusion that we've been buffering here for darn near a decade!

This change is pretty straightforward implementation. rebatchRequests simply sends n requests into the DropOperator and continues to stay ahead of the requests made by the downstream by up to n. There is a batching implementation inside of the operator that RxJava uses to implement this.

Users could then set the buffer with the: mantisClient.buffer.size and workerClient.buffer.size properties.

Checklist

github-actions[bot] commented 3 weeks ago

Test Results

615 tests  +1   605 ✅ +1   8m 7s ⏱️ +6s 142 suites ±0    10 💤 ±0  142 files   ±0     0 ❌ ±0 

Results for commit 5756e685. ± Comparison against base commit 7f996f85.

:recycle: This comment has been updated with latest results.

crioux-stripe commented 3 weeks ago

On the subject of the pushes for test values: It took me a few cycles to realize that multiple unit tests were sharing the same metrics so I needed to use a unique ID.

I believe the current publish failure is unrelated, it looks like an actor call flaked/timed out on an unrelated piece of code.

crioux-stripe commented 3 weeks ago

Confirmed by committing a no-op comment change that the CI failure was a flaky unrelated test.

crioux-stripe commented 3 weeks ago

We have a job internally that uses the same format as JobSource but for our GRPC client. This happened to drops after deploying this change to that source.

Screenshot 2024-10-31 at 8 35 12 PM
Andyz26 commented 3 weeks ago

On the subject of the pushes for test values: It took me a few cycles to realize that multiple unit tests were sharing the same metrics so I needed to use a unique ID.

I believe the current publish failure is unrelated, it looks like an actor call flaked/timed out on an unrelated piece of code.

this is very cool! btw did you see any comparison on latency if the buffer size is set to something higher?

crioux-stripe commented 3 weeks ago

We didn't see a ton of latency impact but mostly because the use case is just a little bursty. We have enough capacity in any given 50ms period but we just may not have the capacity in the 5ms when the data comes in.

We did see some memory usage increase but the job is also handling a million RPS over three workers so that was to be expected! We also had memory to spare.