crossbario / autobahn-python

WebSocket and WAMP in Python for Twisted and asyncio
https://crossbar.io/autobahn
MIT License
2.48k stars 770 forks source link

Twisted issue #9446 causes `ProtocolError` with WAMP RPC and large messages/busy server on Windows #993

Closed asodeur closed 6 years ago

asodeur commented 6 years ago

Edit: The root cause is with Twisted #9446. See end of thread.

One of our applications is calling WAMP procedures with large arguments (GraphQL queries) returning large query results. We are getting regular errors autobahn.wamp.exception.ProtocolError: INVOCATION received for request ID <id> already invoked. We suspect there is an issue when calls/results are split over frames concurrently.

Tried to reproduce this with an isolated test case but the code below produces autobahn.wamp.exception.ProtocolError: RESULT received for non-pending request ID <id> for more than half of the runs (which id actually fails is varying) which is at leat originating from the very same function as the exception above. Code works fine when ARG_SIZE is choosen << 1000.

We are running Python 3.6.0 64bit on Windows 7. Any ideas?

from autobahn.asyncio.component import Component
from autobahn.wamp.types import RegisterOptions
from asyncio import Future, gather, get_event_loop, sleep

component = Component(
    transports=[
        {
            "type": "websocket",
            "url": "ws://127.0.0.1:5000/ws",
            "endpoint": {
                "type": "tcp",
                "host": "127.0.0.1",
                "port": 5000
            }
        }
    ],
    authentication={
      "wampcra": {
        'authid': 'backend',
        'secret': 'super secret'
      }
    },
    realm='default'
)

_last = Future()

@component.register('org.test.echo_next_call', options=RegisterOptions(details_arg='details'))
async def echo_next_call(arg, details):
    """last call will be ready when `echo_next_call` gets called again, interlacing calls and results"""
    global _last

    if not _last.done():
        _last.set_result(arg)
    else:
        # only get here when things went wrong
        pass

    future = Future()
    _last = future

    return await future

ARG_SIZE = 10000

@component.on_ready
async def on_ready(session):
    global _last

    future = gather(*[session.call('org.test.echo_next_call', 'bla' * ARG_SIZE) for _ in range(8)])

    # wait long enough such that we are sure to .set_result on the last future, otherwise future will never finish
    await sleep(1.)

    if not _last.done():
        _last.set_result('done')
    else:
        # we only get here when an exception was raised in gather(...)
        pass

    await future

    session.leave()

if __name__ == '__main__':
    loop = get_event_loop()
    future = component.start(loop)
    loop.run_until_complete(future)

Raises the following exception in about half of all runs:

Traceback (most recent call last):
  File "C:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\websocket.py", line 95, in onMessage
    self._session.onMessage(msg)
  File "C:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\protocol.py", line 828, in onMessage
    raise ProtocolError("RESULT received for non-pending request ID {0}".format(msg.request))
autobahn.wamp.exception.ProtocolError: RESULT received for non-pending request ID 2
meejah commented 6 years ago

Not sure if this is relevant to this at all, but there was a past instance of a 32-bit number being used for IDs -- if any request-ids got that small again, this could be a symptom (i.e. with enough outstanding requests, you start getting occasional collisions).

asodeur commented 6 years ago

The number of WAMP requests should not be an issue, the example is not doing too many of them (10 method invocations).

I tried to track down if the issue is on the sending or the receiving side by capturing the packets (see attached, actually a Wireshark .pcap file but github does not like that extension ... This run terminated with ProtocolError: RESULT received for non-pending request ID 3)

My knowledge of the protocols is not sufficient to make proper sense of this, but to me it looks like things go wrong between packets 310 and 325 (325 has an unknown Opcode). After that Wireshark seems to get confused and does not properly reassemble the websocket messages. I could not make sense of the raw client->router packets as they are masked.

Tried to turn off masking on the client to be able to read the result packets ("apply_mask": False). After that the original ProtocolError no longer appears but the following error shows-up in Crossbar log sporadically (with apply_mask=True 4/10 runs fail, with apply_mask=False no more than 1/10):

Traceback (most recent call last):
  File "c:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\serializer.py", line 104, in unserialize
    raw_msgs = self._serializer.unserialize(payload)
  File "c:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\serializer.py", line 239, in unserialize
    return [_loads(data.decode('utf8')) for data in chunks]
  File "c:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\serializer.py", line 239, in <listcomp>
    return [_loads(data.decode('utf8')) for data in chunks]
  File "c:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\serializer.py", line 183, in _loads
    return json.loads(s, cls=_WAMPJsonDecoder)
  File "c:\users\r994051\appdata\local\programs\python\python36\Lib\json\__init__.py", line 367, in loads
    return cls(**kw).decode(s)
  File "c:\users\r994051\appdata\local\programs\python\python36\Lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "c:\users\r994051\appdata\local\programs\python\python36\Lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\websocket.py", line 88, in onMessage
    for msg in self._serializer.unserialize(payload, isBinary):
  File "c:\virtualenvs\weboptions\lib\site-packages\autobahn\wamp\serializer.py", line 106, in unserialize
    raise ProtocolError("invalid serialization of WAMP message ({0})".format(e))
autobahn.wamp.exception.ProtocolError: invalid serialization of WAMP message (Expecting value: line 1 column 1 (char 0))

localhost_failed_result_3_pcap.txt

asodeur commented 6 years ago

By now I found the logOctets/logFrames options and simplified the example (see below). You can already cause trouble with a simple echo example. Full logfiles of Crossbar, the client, and package capture are attached.

This run went down with double incovation of request 5019. As can be seen from the Crossbar log, Crossbar tries to send the invocation exactly once (lines 31408, 31409). However, it shows up twice on the wire (packets 2934, 2973) and hence twice on the client (octets in lines 51779, 53833; frames in 51831, 54755).

Am I missing some fundamental configuration? Otherwise this is starting to look like a fairly low-level (Windows related?) networking issue. How to take it from here?

Test script used:

from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument('--use_twisted', action='store_true')
parser.add_argument('--use_rawsocket', action='store_true')
parser.add_argument('--msg_counts_to', type=int, default=10)
parser.add_argument('--num_of_calls', type=int, default=10000)
parser.add_argument('--loglevel', type=str, default='info')
parser.add_argument('--asyncio_use_selector_loop', action='store_true')
parser.add_argument('--address', type=str, default='127.0.0.1')
parser.add_argument('--no_client', action='store_true')
parser.add_argument('--no_server', action='store_true')

args = parser.parse_args()

if not args.use_twisted:
    from asyncio import get_event_loop, ProactorEventLoop, SelectorEventLoop, set_event_loop
    if args.asyncio_use_selector_loop:
        loop = SelectorEventLoop()
    else:
        loop = ProactorEventLoop()
    set_event_loop(loop)

    from autobahn.asyncio.component import Component
else:
    from autobahn.twisted.component import Component, run

from autobahn.wamp.types import RegisterOptions
from logging import basicConfig, DEBUG

from txaio import add_callbacks, gather, start_logging, using_asyncio

start_logging(level=args.loglevel)
basicConfig(level=DEBUG)

if not args.use_rawsocket:
    transports = [
        {
         "type": "websocket",
         "url": "ws://{}:5000/ws".format(args.address),
         "endpoint": {
             "type": "tcp",
             "host": args.address,
             "port": 5000
         },
         "max_retries": 0,
         "options": {
             "maxFramePayloadSize": 1048576,
             "maxMessagePayloadSize": 1048576,
             "autoFragmentSize": 65536,
         }
        }
    ]
else:
    transports = [
         {
             "type": "rawsocket",
             "url": "ws://{}:5000/ws".format(args.address),
             "endpoint": {
                 "type": "tcp",
                 "host": args.address,
                 "port": 5001
             },
             "serializer": "json",
             "max_retries": 0
         }
    ]

component = Component(
    transports=transports,
    authentication={
      "wampcra": {
        'authid': 'backend',
        'secret': 'super secret'
      }
    },
    realm='default'
)

if not args.no_server:
    @component.register('org.test.echo', options=RegisterOptions(details_arg='details'))
    def echo(arg, details):
        return arg

if not args.no_client:
    @component.on_ready
    def on_ready(session):
        future = gather([session.call('org.test.echo', ''.join(str(ii) for ii in range(args.msg_counts_to))) for ii in range(args.num_of_calls)])

        def leave_session(value):
            num_ok_results = sum(isinstance(v, str) for v in value)
            session.leave()
            print("Got {} ok results".format(num_ok_results))
            return value

        add_callbacks(future, leave_session, None)

if using_asyncio:
    print("Running on asyncio.")
    loop.set_debug(True)
    future = component.start(loop)
    loop.run_until_complete(future)
else:
    print("Running on Twisted.")
    run([component])

logs.zip

oberstet commented 6 years ago

to track this down, more info that allows to bisect the root cause would be good. here are some things that could shed new light:

the latter is a statically built crossbar (a very new version) for windows I created recently: use it for testing only (it is fine, I promise) ..

using the single-file exe allows us reduce the number of variables in this game, as we can easily run both the exact same bits of crossbar.

oberstet commented 6 years ago

@meejah I think above shows that - regardless what the specific issue in this case (if any) is - we should really expand our CI, and integrate (so it automatically runs) what we already have

meejah commented 6 years ago

Thanks for the detailed bug-report @asodeur !

I have been playing with your example code on Debian -- if I turn up the "concurrent outstanding calls" to 1M I start to see some failures (e.g. low-level send()-s failing), and some of these do result in suspicious-looking failures at the Python level (although not in the same way as your traces). I see different failures if I use two different components (i.e. one registers the echo RPC, the other calls it).

Have you tried with two components? (Looking at the code, we shouldn't overlap IDs until 2^53 outstanding calls -- but that doesn't necessarily rules that possibility out.)

meejah commented 6 years ago

@oberstet yes, getting more automated testing "actually fully automated" would be really good!

I'm not very familiar with the websocket testsuite, perhaps making it run when github wants it to would be a good excuse :)

meejah commented 6 years ago

@asodeur if you use IRC, please ping me in #autobahn on Freenode (I'm "meejah" there too)

asodeur commented 6 years ago

Been in meetings for most of the day but this is what I found to date. Updated the script in my previous comment with switches for most of the things I tried.

We are a Windows shop but I briefly got my hands on a MacBook. Ran all tests with up to --num_of_calls=100000 and did not see a failure. Looks much more robust than Windows. I am not sure if I will get access to a Linux box any time soon.

Back on Windows (Windows 7 Enterprise SP1, everything on localhost, especially no firewals etc) I tried running the client on twisted, running client and server in different processes, and running over raw sockets (in all possible combinations). The call that is failing and the message vary from run to run but 10k calls fell over every single time.

Another thing I tried was stressing a bare-bones Twisted echo server and an autobahn.WebSocket echo server. Did not see any issues even with at least ten times the load the WAMP test script is creating.

I will keep working through the checklist.

meejah commented 6 years ago

At 100k requests on asyncio + linux, I can get a few weird things to happen if the callee and caller are in the same process (but they go away when they're in different processes). I think there is some underlying problem here and for whatever reason Windows is making it crop up "sooner"? (i.e. under less load).

(The "weird things" are that it looks like the transport goes away prematurely, so some self._transport.send(reply) calls fail causing some del self._invocation[] calls to also fail)

meejah commented 6 years ago

(Interestingly, I can only get the 'weird things' to happen on asyncio, not twisted). Also I have to turn off the ping-timeouts (or make the timeout bigger) using the example-router config.

meejah commented 6 years ago

Ahhh, okay, so it's not a problem with the calls -- whatever happens when tearing down connections is slightly different in asyncio I guess; with long enough timeouts both asyncio and twisted succeed for 100k outstanding calls at once, but if an auto-ping takes out (some of) the calls there's some errors in the asyncio case.

So, I still can't get any of the "original bad stuff" from this report to happen (on Linux, anyway).

asodeur commented 6 years ago

I have auto-ping turned off as well. So it looks like Linux and MacOS cope pretty well but Windows is pretty fragile.

The self._transport.send(reply) issue shows up with my test after the first double invocation kills the transport. A quick and dirty fix might be skipping the reply if the transport went away before the success callback gets called (patch attached). Probably a better strategy would be to cancel() the InvocationRequests.

Another tiny py3 glitch I noticed running the autobahn-testsuite is an integer division in WebSocketProtocol. 0009-Skipping-Yield-message-if-transport-gets-closed-befo.txt 0064-integer-division-for-py3.txt

With that the autobahn-testsuite seems to pass for testeeclient vs fuzzingserver and fuzzingclient vs testeeserver. (I am saying seems because I had to do a quick port to py3 since our py2.7.3 does no longer work with the new pypi, and did update to autobahn 18.4.2.dev. Not 100% I got all that right.)

Using the static crossbar.exe from above did not change results.

If I find some time on Wed I'll try if I some of the 9.x.x tests from autobahn-testsuite break down on Windows under sufficient load.

oberstet commented 6 years ago

if you have the testsuite (the fuzzing client part), the you can run that directly against crossbar, as it has a built in websocket testee endpoint: https://github.com/crossbario/crossbar-examples/tree/master/benchmark/websocket

this is a streaming websocket capable endpoint, so it will be "apples and oranges" comparing to WAMP, but nevertheless, if you can pump through big chunks during the websocket tests using endpoint A (testee type) in crossbar, while at the same time endpoint B (running in the very same crossbar) fails for largish WAMP payloads, that would further reduce the code lines of possible source of issue that are candidates.

anyways: yeah, crossbar on windows hasn't been taken under load and fire as on *nix ..

asodeur commented 6 years ago

I found a working configuration and tracked down the main issue to Twisted's IOCPReactor by now.

Crossbar's websocket testee endpoint passed the testsuite after some wrestling with the new config file format. There is a typo in the Autobahn's asyncio raw socket implementation that causes trouble with messages split over TCP frames (0002-fixed-issue-with-split-messages.txt). After this fix the failures became pretty deterministic in rawsocket mode.

The issue with Twisted's IOCPReactor seems to be that if you write new data to a saturated TCP connection all in-flight data gets send again (iocpreactor.abstract.FileHandle.doWrite can get called again before ._handleWrite got the chance to update the bytes already written). If the router worker is configured to run on the select reactor everything works fine (I did overlook crossbar --reactor=select start is not enough to change the router worker's reactor).

I hope I'll get the time to write-up a bug report to the Twisted folks next week. Would it make sense to make select the default reactor on Windows in the meantime?

PS: Would you be interested in the py3/autobajn 18.4.2 port of autobahn-testsuite? Shall I prepare a pull request?

oberstet commented 6 years ago

@asodeur woah! you've been digging deep;) ok, if the issue is with backpressure on the IOCP Twisted reactor, then yeah, best is to first file an issue on Twisted. But it will require (obviously) digging even deeper. The IOCP reactor is based on the Windows completion port model ("signal when done") which is quite different from the Unix model ("signal when actionable"), so the handling of backpressure etc .. could be tricky. anyways, thanks for all the research work!

btw, I fixed the bug you found in rawsocket/asyncio as part of this https://github.com/crossbario/autobahn-python/compare/59f119272e61%5E...acdb623e3d03#diff-bdbe67d81608026a166c457814e89a2f

rgd using the select reactor on windows: so you patched the code? because it should work with:

CROSSBAR_REACTOR=select crossbar start

on windows in principle. with master. because I restructured the whole reactor loading, in part because of windows work we recently did, and other reasons. if it doesn't work as above on windows, that is a bug we should fix. that being said, select reactor has at least the limits on number of file descriptors (255 on windows pseudo-unix if I remember), and .. the signal handling again .. mmh. no clue how reliable that is. and the worker to node controller communication (runs over pipes). all these are dependent on deep down reactor features in the end.

oberstet commented 6 years ago

Would you be interested in the py3/autobajn 18.4.2 port of autobahn-testsuite

yeah, please submit it! cool. I really cannot promise when I will time to look deep into .. because a lot has happened in the code;) and there is still the idea to pretty much version freeze the testsuite (we also have a docker image) to have a stable reference testsuite (even though an imperfect one of course). but this can be looked at from another angle too ... so actually, have an open PR would allow to gather feedback on those two views already. so yeah, would be great to have that work at least saved in a PR!

asodeur commented 6 years ago

Closing this as the rot cause is with Twisted #9446. Also renaming the issue to refer to the Twisted issue and mark it as Windows only.

Using the select reactor is a workaround if the open connection limit is not an issue.

oberstet commented 6 years ago

@asodeur thanks for coming back and filing the twisted issue. I added an issue on our side (even though its upstream business) to make sure we eventually follow up on this.

one practical idea: did you try to increase TCP window buffers size manually in windows? eg HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize => 16777216. normally, this is auto-tuned at run-time and pre connection these days, but I wonder if you could work around this as it is backpressure related ..

asodeur commented 6 years ago

The current iocp reactor implementation will run into trouble as soon as the Windows system call is not writing all data immediately and returning synchronously. I have no clue about Windows internals but would be surprised if writing to sockets is synchronous until TCP window size is exceeded.The good news is I am fairly confident the issue with the iocp reactor has a very easy fix. Already opened a pull request.A-------- Original Message --------Subject: Re: [crossbario/autobahn-python] Twisted issue #9446 causes ProtocolError with WAMP RPC and large messages/busy server on Windows (#993)From: Tobias Oberstein To: crossbario/autobahn-python CC: asodeur ,Mention @asodeur thanks for coming back and filing the twisted issue. I added an issue on our side (even though its upstream business) to make sure we eventually follow up on this.

one practical idea: did you try to increase TCP window buffers size manually in windows? eg HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpWindowSize => 16777216. normally, this is auto-tuned at run-time and pre connection these days, but I wonder if you could work around this as it is backpressure related ..

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread. {"@context":"http://schema.org","@type":"EmailMessage","potentialAction":{"@type":"ViewAction","target":"https://github.com/crossbario/autobahn-python/issues/993#issuecomment-394974682","url":"https://github.com/crossbario/autobahn-python/issues/993#issuecomment-394974682","name":"View Issue"},"description":"View this Issue on GitHub","publisher":{"@type":"Organization","name":"GitHub","url":"https://github.com"}} {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/crossbario/autobahn-python","title":"crossbario/autobahn-python","subtitle":"GitHub repository","main_image_url":"https://assets-cdn.github.com/images/email/message_cards/header.png","avatar_image_url":"https://assets-cdn.github.com/images/email/message_cards/avatar.png","action":{"name":"Open in GitHub","url":"https://github.com/crossbario/autobahn-python"}},"updates":{"snippets":[{"icon":"PERSON","message":"@oberstet in #993: @asodeur thanks for coming back and filing the twisted issue. I added an issue on our side (even though its upstream business) to make sure we eventually follow up on this.\r\n\r\n\u003e one practical idea: did you try to increase TCP window buffers size manually in windows? eg HKEY_LOCAL_MACHINE\\SYSTEM\\CurrentControlSet\\Services\\Tcpip\\Parameters\\TcpWindowSize =\u003e 16777216. normally, this is auto-tuned at run-time and pre connection these days, but I wonder if you could work around this as it is backpressure related .."}],"action":{"name":"View Issue","url":"https://github.com/crossbario/autobahn-python/issues/993#issuecomment-394974682"}}} { "@type": "MessageCard", "@context": "http://schema.org/extensions", "hideOriginalBody": "false", "originator": "37567f93-e2a7-4e2a-ad37-a9160fc62647", "title": "Re: [crossbario/autobahn-python] Twisted issue #9446 causes ProtocolError with WAMP RPC and large messages/busy server on Windows (#993)", "sections": [ { "text": "", "activityTitle": "Tobias Oberstein", "activityImage": "https://assets-cdn.github.com/images/email/message_cards/avatar.png", "activitySubtitle": "@oberstet", "facts": [

] } ], "potentialAction": [ { "name": "Add a comment", "@type": "ActionCard", "inputs": [ { "isMultiLine": true, "@type": "TextInput", "id": "IssueComment", "isRequired": false } ], "actions": [ { "name": "Comment", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"IssueComment\",\n\"repositoryFullName\": \"crossbario/autobahn-python\",\n\"issueId\": 993,\n\"IssueComment\": \"{{IssueComment.value}}\"\n}" } ] }, { "targets": [ { "os": "default", "uri": "https://github.com/crossbario/autobahn-python/issues/993#issuecomment-394974682" } ], "@type": "OpenUri", "name": "View on GitHub" }, { "name": "Unsubscribe", "@type": "HttpPOST", "target": "https://api.github.com", "body": "{\n\"commandName\": \"MuteNotification\",\n\"threadId\": 325595789\n}" } ], "themeColor": "26292E" }