self.publish doesn't work in an infinite loop

sametmax commented 9 years ago

I have this code :

class MyComponent(ApplicationSession):

   def onConnect(self):
      self.join("realm1")

   @inlineCallbacks
   def onJoin(self, details):
            for x in range(100):
                yield self.publish('foo', 'bar')

And it works : I can read the event on all the clients.

Now if I do :

class MyComponent(ApplicationSession):

   def onConnect(self):
      self.join("realm1")

   @inlineCallbacks
   def onJoin(self, details):
            while True:
                yield self.publish('foo', 'bar')

I don't get any event on any client. Is it too fast and flooding the router ? In that case shouldn't find a way to handle it by processing them batch by batch ?

meejah commented 9 years ago

I think the issue here is that self.publish doesn't actually return a Deferred unless you ask for an acknowledgement via PublishOptions. So what's happening is that inlineCallbacks keeps going on each yield, as there's nothing to actually wait for ... however, no data ever goes out on the wire until the function exits (see below).

So, if you don't need acks, you could use an idiom like this to get 100-at-a-time batches:

def do_publishes(self):
    for _ in range(100):
        self.publish('foo', 'bar')
    reactor.callLater(0, self.do_publishes)

Alternaitvely, if you do ask for ack's but still want to do "batches" you'd want to use a DeferredList and choose a chunk-size in an inner-loop, something like:

while True:
    pubs = []
    for x in range(10):
        pubs.append(self.publish('foo', 'bar', options=PublishOptions(acknowledge=True)))
    yield DeferredList(pubs)

I did try making publish return an already-succeeded Deferred (instead of None) but that exibits the same behavior (i.e. nothing gets sent). It may be worth noting that in my test, I had the publisher and subscriber running in the same container (over loopback) but also tried them running in separate processes...Interestingly, in the infinite-loop case there's actually no data going out on the wire at all, which leads me to suspect perhaps a Twisted issue ...

AutobahnPython could solve this by, basically, doing the callLater trick itself -- that is, always return a Deferred and if no acknowledge was requested, do a callLater(0, ...) to resolve it. While this will work for use-cases like the above it would cause a lot more more event-scheduling to happen in all (most) publish() use-cases and I'm not sure if that's worth it ... @oberstet what do you think?

In any case, I think it's worth understanding what Twisted is doing here (and either change Autobahn's code or documentation).

sametmax commented 9 years ago

Thanks for this very insightful answer.

I do think it's an important issue to fix. I tried to demonstrate crossbar to a friends working in a big american python shop and basically ended up making a fool of myself, spending half an hour trying to fix a non existent bug.

Given that error reporting with autobahn is lacking, at best, it made the experience quite frustrating because I couldn't find what I was doing wrong. Combined with the usual gotchas (not calling connect() in JS or forgetting that publishers don't receive their own publication by default), I had 0 thing working and no clue why for a looooog time.

Now, as a crossbar fan, I'm willing to encounter these problems and deal with it, but somebody trying by himself will just throw the product in a trash bin.

meejah commented 9 years ago

I agree that autobahn "fails to log/re-raise" too many errors. The logging situation will be getting better soon.

meejah commented 9 years ago

As to the issue in this ticket, I tried similar code in a bare Twisted protocol sending some bytes and see the same behavior -- everything is getting buffered in userspace and no bytes go out until the loop exits.

So it seems to indeed be an issue with Twisted

meejah commented 9 years ago

The inlineCallbacks isn't really doing anything here since the Deferred's are always completed (so it keeps going). Hence, if we follow this down, you're essentially just calling transport.write() in an infinite loop and Twisted doesn't ever interrupt this since this would mean that every single .write() call might jump out of your code for a trip down the reactor loop for some OS write syscalls (and they don't do that on philosophical grounds). (Thanks to @MostAwesomeDude in #twisted for the help with this).

I'm not sure what your "real" use-case is here, but you'll have to do either the callLater trick I posed above or use producer/comsumer sort of flow. I'm not really sure if there's anything sensible that we can do in Autobahn's publish as we either just always schedule a callLater (with obvious overhead of, probably, growing the scheduled-call list and inducing more syscalls) or something more-complex in (subclasses of) the underlying Twisted transports (to force Twisted to write out data if our buffer has gotten "too full").

This latter isn't a completely bad idea: there are cases besides "infinite loop" where we'll end up with lots of stuff buffered in Python, however this needs more consideration and concrete use-cases I think since there are trade-offs here as well between buffering-in-Python and "how many syscalls are we making". For "maximum throughput" you'd want less syscalls with more data, but that hurts latency; for low-latency you're happy to have many syscalls but of course that hurts throughput. It's hard to imagine how Autobahn can ever get this right for everyone, so it would need some kind of configuration/knob to control...

For now, this trade-off gets to be made by the application developer: in the do_publishes snippet from my earlier comment, if you make the loop go 1000 times, you'll get higher throughput (but more latency) since you'll be making OS-level write calls with hugh buffers, whereas if you make it 10 or 1 you'll get lower latency (i.e. every publish will (probably) induce a syscall unless the system is overloaded) but your maximum throughput will be lower as your process will context-switch more often.

sametmax commented 9 years ago

I learn more with this ticket than during the last month reading tutorials thanks.

The infinite loop is rarely useful in a production situation, but it's often used in developpement, tests or demos to flood all you clients with data. I'm not the only one going to do this, and because it's so tempting to do it to show off how quick crossbar is, it will be used in front of colleagues, friends or business people.

So even if it's not an important technical issue (we could just document it and leave it as it), it's an important PR issue.

oberstet commented 9 years ago

as already discussed, with standard publish (no ACK), it'll just blow up your memory. this is how twisted works (and asyncio). without producer/consumer pattern or flow-control. with acknowledged publications, the flow-control is essentially done by the peer (router): it send the ACK.

@sametmax so what should we do? better docs I guess;)

sametmax commented 9 years ago

Pretty much :)

crossbario / autobahn-python

self.publish doesn't work in an infinite loop #431