High CPU utilization with auto ping enabled

DZabavchik commented 8 years ago

When auto ping is enabled on websocket transport, each instance of WebSocketServerProtocol schedules _sendAutoPing with

self.autoPingPendingCall = txaio.call_later(self.autoPingInterval, self._sendAutoPing)

Which also schedules onAutoPingTimeout

self.autoPingTimeoutCall = txaio.call_later(self.autoPingTimeout, self.onAutoPingTimeout)

Which gets cancelled 99.9% of time when pong response arrives.

With just 30K connected clients, the CPU core is at 60% utilization by auto-ping/pong alone (no messages). Granted that it is inversely proportional to auto_ping_interval, and increasing timeout to 120s results in 30% CPU utilization. Unfortunately, there is a substantial number of environments where TCP idle timeout is too short (< 120s), which results in disconnects/reconnects (There may be other reasons for using short ping intervals).

"transports":` [
        {
          "type": "websocket",
          "endpoint": {
            "type": "tcp",
            "port": 8880
          },
          ...
          "options": {
            ...
            "auto_ping_interval": 60000,
            "auto_ping_timeout": 10000,
            "auto_ping_size": 4
          }
        },

Since I started entering this issue, I implemented a proof of concept for batched pings. Will post results shortly

DZabavchik commented 8 years ago

So by batching pings into slots/shards I was able to get CPU utilization down and increase density of connected clients per core by factor of ~6.

Same amount of traffic as before Traffic bytes and packets

Same memory (top chart), but CPU utilization is 1/6 of the original (where each protocol schedules it's own ping/ping timeout check)

Memory/CPU utilization

Clients are connecting and registering callees at a steady rate. Once the registrations stop, CPU utilization drops (around 9:50pm, for normal pings and 12:00 am for batched). 6x is for idle (pings only, no messages or registrations).

Now I can confidently get 100K nodes per CPU core.

This is just a proof of concept. Doesn't feel like protocol factory is an appropriate place. So no PR, just a patch for consideration. (pardon my Python ignorance, total Python noob). 0001-Batched-pings-PoC.patch.zip

oberstet commented 8 years ago

Are you running on PyPy? Could you post the output of crossbar versionplease?

Other reasons to have a relatively short ping interval/timeout: NATs. In particular on mobile networks.

Yet another reason: mobile devices will agressively put the radio into low power states quickly. Up to near shutdown of radio. When that happens, and an event comes in (for a subscription), the event will be delayed, since the radio first needs to get powered up again. Of course the tradeoff is battery.

DZabavchik commented 8 years ago

[root@connect ~]# crossbar version
     __  __  __  __  __  __      __     __
    /  `|__)/  \/__`/__`|__) /\ |__)  |/  \
    \__,|  \\__/.__/.__/|__)/~~\|  \. |\__/

 Crossbar.io        : 0.13.2
   Autobahn         : 0.13.1 (with JSON, MessagePack, CBOR)
   Twisted          : 16.1.1-EPollReactor
   LMDB             : 0.89/lmdb-0.9.18
   Python           : 2.7.5/CPython
 OS                 : Linux-3.10.0-327.13.1.el7.x86_64-x86_64-with-redhat-7.2-Maipo
 Machine            : x86_64

DZabavchik commented 8 years ago

Even with PyPy, increasing density 5X and allowing shorter ping intervals (I'd love to do ~30s), I'll take that any day - no questions

oberstet commented 8 years ago

Could you retry using our binary packages for Crossbar.io, because that has the latest PyPy and everything bundled:

This is just to see the absolute amount of CPU load you get - of course, things don't change qualitatively just by using PyPy.

There are downsides of the sending pings at the same time across all connections. I thought about essentially what you propose as I wrote this code, and at that time, decided against.

oberstet commented 8 years ago

Eg you get bursts of WebSocket pings being sent each N secs across all connections. This might lead to other issues.

Note: I am not saying there isn't something to be improved here! There is overhead of maintaining timers per connection, and should we determine it's too much, then we have to find a way.

@bbangert I am wondering what Mozilla's experience is with this in https://github.com/mozilla-services/autopush - I know you are using plain AutobahnPython at the WebSocket level, but these timers are the same there.

DZabavchik commented 8 years ago

That was exactly my though about blasting pings to all. So instead I'm batching them in slots and send a batch every second (see patch). Number of slots equals to ping interval seconds. Each instance of protocol falls into a slot depending on when it connected. So there are no spikes of traffic.

meejah commented 8 years ago

@dzabavchik nice graphs :)

Personally, I kind of like the idea of "buckets" for the pings.. I can see there's some obvious tradeoffs here of course, and will make bandwidth (at least for pings) somewhat "burst-y". But "on average" if you've got a ton of connections it seems like we should be able to make something work nicely such that "pings out per second" are approximately the same as before.

Especially if we're a bit careful about what happens within a bucket: if we've got a bucket with 5k pings in it that's expired and ready-to-send, then when processing those if we yield to the reactor every 5 or 10 pings it'll still give "other stuff" a chance to go, and hopefully smooth out the bandwidth a bit -- because presumably "ping traffic" shouldn't be the majority of the traffic...We could even put in small timeouts per loop so that we e.g. take several seconds to send out all the pings in a single bucket, further smoothing the bandwidth...

p.s. I haven't read through the patch yet...

meejah commented 8 years ago

So to make my suggestion more concrete, we could probably further reduce "bursty-ness" by making on_auto_ping_run_slot async, and replacing the loop with something like (just for the idea; there's probably bugs in the below ;)::

        for chunk in range(0, len(ping_slot_protocols), chunksize):
            for idx in range(chunksize):
                protocol = self.ping_slot_protocols[chunk + idx]
                protocol.autoPingPending = autopingpending
                protocol.sendPing(autopingpending)
            yield

That is, only send chunksize pings at once before yielding to the reactor. Maybe if the slots are narrow enough this isn't even really a concern...

DZabavchik commented 8 years ago

There is definitely room for improvement:

Multiple slots ("buckets" (c) @meejah ) per second, with smaller number of protocols in a slot
Dynamic rebalancing of buckets (some clients will get their pings a couple of seconds (fractions of) later than expected). This is to smooth out and distribute pings evenly.

DZabavchik commented 8 years ago

@meejah, As I said, I'm total Python noob, ignorant to details of reactor inner workings. It appears that combining chunks/yields and multiple slots per second may be the best approach. So a _slot_multiplier will change total number of slots

self.slotCount = int(self.autoPingInterval) * self._slot_multiplier

And change the scheduling interval

 self.autoPingPendingCall = self.reactor.callLater(1. / self._slot_miltiplier, self.on_auto_ping_run_slot)

DZabavchik commented 8 years ago

@oberstet to answer your question about PyPy. It does provide an improvement of about 30%. I'm wondering if that will be the case when PyPy is combined with batched pings.

CPython 13.2 non-batched, CPython 13.2+ batched, PyPy nonbatched

Couldn't do chunks because sets are not sliceable. Did add an option to thin-slice pings with slot multiplier. New patch 0001-Batched-pings-PoC-with-Slot-Multiplier.patch.zip

DZabavchik commented 8 years ago

Multiplier test results. CPU utilization went up by ~40% with 4x multiplier (250ms slices)

1x multiplier vs 4x multiplier

oberstet commented 8 years ago

I think the combination of batching and chunking (yield into reactor) is going in the right direction. It'll make ping traffic less bursty (batching) and it'll allow to let the event loop do other stuff in between (yield).

Rgd CPython vs PyPy: the more CPU work there is to do, the more gain PyPy will bring. PyPy might have more memory footprint - we (Autobahn/Crossbar.io) have some homework do to here (slots etc). I also think I saw a discussion on PyPy IRC where benbangert and fijal were discussion leaking 30MB in 24h run in Mozilla's https://github.com/mozilla-services/autopush - which is using AutobahnPython for WebSocket. As far as I know, they (still) run CPython in prod.

heard on IRC (rgd Mozilla's use of Autobahn): we're using it at fairly large scale, a mere 600k connections at once, but we're moving up to tens of millions soon

Now, this autoping optimization (whatever we do exactly) needs to be in AutobahnPython (not Crossbar.io) and because of Mozilla's use above (which, obviously is pretty cool and high profile for us), we have to make sure this stuff works at Mozilla's scale. Because I think they do use the current autoping as-is.

oberstet commented 8 years ago

@DZabavchik the charts are quite cool and informative! its late here, but will be around tomorrow ..

glyph commented 8 years ago

Rather than implementing this as something specific to pings, it seems to me that this should be an IReactorTime implementation; a callLater that expects lots of low-resolution work and batches it together. This would be a lot more generally useful (and perhaps suitable for inclusion in Twisted itself) if this sort of CPU utilization metrics gaming crops up in other areas.

glyph commented 8 years ago

@oberstet - PyPy actually typically reduces memory footprint. __slots__ are useful for memory utilization only on CPython, as PyPy will introduce phantom classes based on your actual attribute usage, making simple (new-style) classes actually more memory-efficient than anything hand-tuned with __slots__ or custom C code.

oberstet commented 8 years ago

@glyph

What you touch on with a generic, batching/chunking IReactorTime sounds pretty neat. And yes, I expect this (CPU load due to massive use of short-timed timers) to pop up at more places (at the WAMP layer of Autobahn .. eg RPC timeouts and such).

The problem (well, ours) with doing it in Twisted: Autobahn transparently supports Twisted and asyncio. We would need to paper over it then anyway here to make it work on asyncio too.

slots: ok, I really have to look at all this deeper. And measure it. Presumably also use more vmprof.

There are some place in Autobahn where we can (not slots) reduce object dicts considerable though .. and that seems good in any case.

Also: @hawkowl was mentioning TWISTED_NEWSTYLE=1 => ++1

She was also mentioning fast tracebacks coming to PyPy 6 .. to optimzed Twisted's yield trampoline .. that also sounds quite desirable! Awesome.

glyph commented 8 years ago

We would need to paper over it then anyway here to make it work on asyncio too.

call_later is a roughly analogous API :). It could be a trivially thin wrapper.

meejah commented 8 years ago

If this happened to be added to Twisted, we could still "paper over" in txaio similar to the way Failures work -- in txaio.tx you get a "real" Failure, whereas in txaio.aio you get a fake/wrapper one.

@glyph I'm not very familiar with how callLater actually works, but would you see this as an "option for 'the' reactor" or more like "a helper object that implements IReactorTime" and provides quantized call-laters as-needed. E.g. like with Clock where you'd pass in the special IReactorTime-provider to things you wanted to get quantized/chunked call-later behavior...?

glyph commented 8 years ago

@meejah - A helper object, definitely. A decorator in the design-pattern sense of the word; an IReactorTime that wraps another IReactorTime and quantizes its wake-up interval.

DZabavchik commented 8 years ago

Didn't mean to be this disruptive. I appreciate all the consideration and attention that was given to this issue. I like where this is going with reactor / AutobahnPython changes (as I mentioned in original post changes were made not in appropriate place, it was just a quick and dirty PoC.). Thank you.

Can't wait for v0.14 w/ multi-core + clusterization/federation.

oberstet commented 8 years ago

I like the idea of adding this to txaio. The idea of the latter is all about accumulating all our "paper over" code ..

Here is a concrete proposal:

We define a txaio.interfaces.IBatchTimer, that then gets implemented in concrete classes in tx and aio flavors. The ctor of a BatchTimer can have knobs to configure the batching/chunking. And can expose a IBatchTimer.call_later with the same interface as the current txaio.call_later (like here and here).

Then, in Autobahn(Python), we use above to implement a better autoping. That is, we would add a concrete BatchTimer of the respective flavor to autobahn.twisted.websocket.WebSocketAdapterProtocol and autobahn.asyncio.websocket.WebSocketAdapterProtocol. We could add it as a class variable, like we do with the logger. That is, the batch timer lives there, not on the factories. This would then still allow to tweak the batch timer / set it to something different on a per-protocol instance basis.

oberstet commented 8 years ago

@DZabavchik

Can't wait for v0.14 w/ multi-core + clusterization/federation.

Yeah.

Actually, I already had 2 nodes with router-to-router PoC working (both events and calls) months ago. But it's all a little bit crazy for me currently, as there are so many things happening in parallel with WAMP, Autobahn and Crossbar.io - and we need to tie up some loose to not get overrun / stay on top of the change front.

Rgd multi-core, not sure if we already talked about, but I am confident this will scale. My experiments at the Web level are here https://github.com/crossbario/crossbarexamples/tree/master/benchmark/web#scaling-up-crossbario---web-services

Note: I cheated a little in above .. it renders HTTP response from code, not serving files - Twisted would need some caching thing here - otherwise it bogs down on file access. Wouldn't be hard, but .. no time;) But 600k requests/sec is good enough for 99.99% of people I guess. Nginx will of course still be faster .. but not thaat much. PyPy is just incredible.

DZabavchik commented 8 years ago

No doubt, it can scale, but routing RPC is a different beast, incomparable to stateless web request/responses. Maintaining global registration map, finding registration owner nodes, discovering peers, forwarding calls to peers, failovers for management nodes ... Huge and fascinating project.

When I was looking for a place to stick in the ping batching prototype I did ran across NodeManagementBridgeSession and CDC uplink code. I also tried running two routers with different affinities and shared port and uplink config (apparently that would work for pub/sub in current version).

Sorry for offtopic. There is probably a different place for this discussion.

oberstet commented 8 years ago

Migrated to:

crossbario / crossbar

High CPU utilization with auto ping enabled #722