Closed methodmissing closed 3 years ago
Pinged some other reviewers.
I think a future improvement could be to buffer some packets and flush them out as batch - IIRC the StatsD protocol does support this but would require some testing.
I love the idea. However I wonder if it would be worth having a single handler rather than one per udpsink.
But maybe that's not worth it as it's unlikelt to have multiple sink instances at the same time?
The Queue instance would still involve a mutex / synchronization point, right? So the perf difference comes from making the syscall async.
No, there's no mutex. The difference is that for MRI, the GVL is preventing multiple threads from running at the same time. The Queue#push
will trigger an MRI interrupt, which wakes up a thread that is blocking via Queue#pop
: https://github.com/ruby/ruby/blob/115fec062ccf7c6d72c8d5f64b7a5d84c9fb2dd8/thread_sync.c#L785. This all happens in user-space without a syscall.
No, there's no mutex. [...]. This all happens in user-space without a syscall.
If another thread is blocked on the queue, then it will take a mutex to interrupt the other thread in rb_threadptr_interrupt_common. Also, I would expect that waking up another native thread would also require another system call in the unblock function call (th->unblock.func)(th->unblock.arg)
. So it looks like this only gets to all happen in user-space when the queue isn't already empty.
I'm concerned the benchmark results aren't representative of real use of this library, since it seems like the performance benefit comes from queuing multiple datagrams consecutively. If datagrams are always queued to an empty queue, then it seems like this could add extra overhead.
I think a future improvement could be to buffer some packets and flush them out as batch
This seems like it could be a much better approach. For instance, we could buffer datagrams during a unit of work (e.g. web request) and use sendmmsg to flush the buffer at the end of that unit of work or when the buffer gets full in the middle of that unit of work.
The DogStatsD library can buffer metrics in a single UDP datagram by separating them with newlines: https://github.com/DataDog/dogstatsd-ruby/blob/master/lib/datadog/statsd/batch.rb
@dylanahsmith If we deploy this branch to production, what metrics should we track to knowhow much of an impact this change has? Regular profiling?
I would look at whatever stats we were looking at that led us to do this optimization. It sounds like we found this hotspot through profiling, so I think it makes sense to compare the profile from after this to one we captured before. Also, if we believe this was related to a performance regression that we noticed and we were hoping this would help resolve, then we could see if this helps with that metric.
Do we plan on reverting this if it has a negligible effect?
I took the liberty to rebase this and to add a few fixes. Notably:
When forking all the thread but the main one are dead in the child process. So if we use a dispatcher thread, they need to be respawned after fork. Either automatically (e..g by checking Process.pid
all the time, or by decorating Process.fork
) or either by requiring the users to call some method.
I went for the later and introduced a .after_fork
method for that purpose. But I'm open to implement "fork tracking" too. It would be easier to use, but libraries monkey patching core methods are not generally well received.
It's also important to clear the queues after fork, because any message left in them would have been inserted in the parent process, so we don't want to duplicate them.
Instead of changing the existing UPDSink
I added an extra ThreadedUDPSink
, I feel like it's better to have it as an option at least until we experiment with it in production. We cal always make it the default later on.
I refactored the tests a bit to be more reliable. Mostly the one about sending events from a signal handler. Since the mutex is eliminated, it's no longer a problem. However the if forked process exit too quickly after the packet is written, it won't make it to the parent.
@methodmissing @wvanbergen @dylanahsmith what do you think?
makes sense to compare the profile from after this to one we captured before.
The problem with this is that the overhead isn't really reduced, just moved in a background thread. I think a profiler would probably report a very similar time spent in the sink.
Closing since https://github.com/Shopify/statsd-instrument/pull/280 is pretty much the same idea with added batching.
Why?
Although UDP uses a connectionless communication model, the following user space code paths aren't free:
Using a
Queue
as contract with a simple dispatch Thread reduce datagram dispatch for the producer to an array push and a user space thread interrupt . And preserves FIFO semantics.Master:
This PR:
Why not?
Threads 😄 The dispatcher is very thin and cleanly separates IO and socket init + teardown on error from the metrics producer thread. Cleanup is handled with a GC finalizer (which of course only fires on clean exit, but this is true for ObjectSpace in general too). See the throwaway section below
Why this is throwaway
Queues do not work X process because of a pid check and therefore this test would always fail because the datagram is enq'ed but
num_waiting
is reset to 0, with the dispatcher thread effectively never seeing the datagram as the queue has no waiters.