Let the UDP sink use a dispatcher thread

methodmissing commented 4 years ago

Why?

Although UDP uses a connectionless communication model, the following user space code paths aren't free:

Mutex held on the socket
Sending datagrams still incurs a system call

Using a Queue as contract with a simple dispatch Thread reduce datagram dispatch for the producer to an array push and a user space thread interrupt . And preserves FIFO semantics.

Master:

methodmissing:statsd-instrument lourens$ benchmark/send-metrics-to-local-udp-receiver
Warming up --------------------------------------
StatsD metrics to local UDP receiver (branch: master, sha: 07fb158)
                         1.213k i/100ms
Calculating -------------------------------------
StatsD metrics to local UDP receiver (branch: master, sha: 07fb158)
                         11.876k (± 3.9%) i/s -     59.437k in   5.012616s

To compare the performance of this revision against another revision (e.g. master),
check out a different branch and run this benchmark script again.

This PR:

methodmissing:statsd-instrument lourens$ benchmark/send-metrics-to-local-udp-receiver
Warming up --------------------------------------
StatsD metrics to local UDP receiver (branch: dispatch-thread, sha: 3137f7e)
                         4.199k i/100ms
Calculating -------------------------------------
StatsD metrics to local UDP receiver (branch: dispatch-thread, sha: 3137f7e)
                         44.251k (±25.9%) i/s -    201.552k in   5.042619s

Comparison:
StatsD metrics to local UDP receiver (branch: dispatch-thread, sha: 3137f7e):    44251.3 i/s
StatsD metrics to local UDP receiver (branch: master, sha: 07fb158):    11921.7 i/s - 3.71x  (± 0.00) slower

Why not?

Threads 😄 The dispatcher is very thin and cleanly separates IO and socket init + teardown on error from the metrics producer thread. Cleanup is handled with a GC finalizer (which of course only fires on clean exit, but this is true for ObjectSpace in general too). See the throwaway section below

Why this is throwaway

Queues do not work X process because of a pid check and therefore this test would always fail because the datagram is enq'ed but num_waiting is reset to 0, with the dispatcher thread effectively never seeing the datagram as the queue has no waiters.

wvanbergen commented 4 years ago

Pinged some other reviewers.

What would be the impact of this when using a forking web/job server?
The Queue instance would still involve a mutex / synchronization point, right? So the perf difference comes from making the syscall async.

I think a future improvement could be to buffer some packets and flush them out as batch - IIRC the StatsD protocol does support this but would require some testing.

casperisfine commented 4 years ago

I love the idea. However I wonder if it would be worth having a single handler rather than one per udpsink.

But maybe that's not worth it as it's unlikelt to have multiple sink instances at the same time?

csfrancis commented 4 years ago

The Queue instance would still involve a mutex / synchronization point, right? So the perf difference comes from making the syscall async.

No, there's no mutex. The difference is that for MRI, the GVL is preventing multiple threads from running at the same time. The Queue#push will trigger an MRI interrupt, which wakes up a thread that is blocking via Queue#pop: https://github.com/ruby/ruby/blob/115fec062ccf7c6d72c8d5f64b7a5d84c9fb2dd8/thread_sync.c#L785. This all happens in user-space without a syscall.

dylanahsmith commented 4 years ago

No, there's no mutex. [...]. This all happens in user-space without a syscall.

If another thread is blocked on the queue, then it will take a mutex to interrupt the other thread in rb_threadptr_interrupt_common. Also, I would expect that waking up another native thread would also require another system call in the unblock function call (th->unblock.func)(th->unblock.arg). So it looks like this only gets to all happen in user-space when the queue isn't already empty.

I'm concerned the benchmark results aren't representative of real use of this library, since it seems like the performance benefit comes from queuing multiple datagrams consecutively. If datagrams are always queued to an empty queue, then it seems like this could add extra overhead.

I think a future improvement could be to buffer some packets and flush them out as batch

This seems like it could be a much better approach. For instance, we could buffer datagrams during a unit of work (e.g. web request) and use sendmmsg to flush the buffer at the end of that unit of work or when the buffer gets full in the middle of that unit of work.

wvanbergen commented 4 years ago

The DogStatsD library can buffer metrics in a single UDP datagram by separating them with newlines: https://github.com/DataDog/dogstatsd-ruby/blob/master/lib/datadog/statsd/batch.rb

wvanbergen commented 4 years ago

@dylanahsmith If we deploy this branch to production, what metrics should we track to knowhow much of an impact this change has? Regular profiling?

dylanahsmith commented 4 years ago

I would look at whatever stats we were looking at that led us to do this optimization. It sounds like we found this hotspot through profiling, so I think it makes sense to compare the profile from after this to one we captured before. Also, if we believe this was related to a performance regression that we noticed and we were hoping this would help resolve, then we could see if this helps with that metric.

Do we plan on reverting this if it has a negligible effect?

casperisfine commented 3 years ago

I took the liberty to rebase this and to add a few fixes. Notably:

Fork handling

When forking all the thread but the main one are dead in the child process. So if we use a dispatcher thread, they need to be respawned after fork. Either automatically (e..g by checking Process.pid all the time, or by decorating Process.fork) or either by requiring the users to call some method.

I went for the later and introduced a .after_fork method for that purpose. But I'm open to implement "fork tracking" too. It would be easier to use, but libraries monkey patching core methods are not generally well received.

It's also important to clear the queues after fork, because any message left in them would have been inserted in the parent process, so we don't want to duplicate them.

Make it optional

Instead of changing the existing UPDSink I added an extra ThreadedUDPSink, I feel like it's better to have it as an option at least until we experiment with it in production. We cal always make it the default later on.

Tests

I refactored the tests a bit to be more reliable. Mostly the one about sending events from a signal handler. Since the mutex is eliminated, it's no longer a problem. However the if forked process exit too quickly after the packet is written, it won't make it to the parent.

@methodmissing @wvanbergen @dylanahsmith what do you think?

casperisfine commented 3 years ago

makes sense to compare the profile from after this to one we captured before.

The problem with this is that the overhead isn't really reduced, just moved in a background thread. I think a profiler would probably report a very similar time spent in the sink.

casperisfine commented 3 years ago

Closing since https://github.com/Shopify/statsd-instrument/pull/280 is pretty much the same idea with added batching.

Shopify / statsd-instrument