dangling registration - possible race cond

colmaengus commented 7 years ago

We been using crossbar 16.10.1 since its release last Nov and its been very stable. As soon as we upgraded to 17.2.1 we've been getting unhandled exceptions when handling an RPC from autobahn.js.

I know there is an existing fix to avoid the unhandled exception however I'm more concerned about why the session transport has disappeared.

What is strange is that there are no errors in the autobahn.js side (running in node.js). The call promise just never gets resolved.

As soon as I roll back to 16.10.1 the problem disappears.

The stack trace is

3/2/2017 8:26:29 PM2017-03-02T20:26:29+0000 [Router         13] Unhandled error in Deferred:
3/2/2017 8:26:29 PM2017-03-02T20:26:29+0000 [Router         13]
3/2/2017 8:26:29 PMTraceback (most recent call last):
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/crossbar/router/router.py", line 192, in process
3/2/2017 8:26:29 PM    self._dealer.processCall(session, msg)
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/crossbar/router/dealer.py", line 464, in processCall
3/2/2017 8:26:29 PM    txaio.add_callbacks(d, on_authorize_success, on_authorize_error)
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/txaio/tx.py", line 490, in add_callbacks
3/2/2017 8:26:29 PM    future.addCallbacks(callback, errback)
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 310, in addCallbacks
3/2/2017 8:26:29 PM    self._runCallbacks()
3/2/2017 8:26:29 PM--- <exception caught here> ---
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
3/2/2017 8:26:29 PM    current.result = callback(current.result, *args, **kw)
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/crossbar/router/dealer.py", line 447, in on_authorize_success
3/2/2017 8:26:29 PM    self._call(session, call, registration, authorization)
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/crossbar/router/dealer.py", line 627, in _call
3/2/2017 8:26:29 PM    self._router.send(callee, invocation)
3/2/2017 8:26:29 PM  File "/usr/local/lib/python3.6/site-packages/crossbar/router/router.py", line 163, in send
3/2/2017 8:26:29 PM    session._transport.send(msg)
3/2/2017 8:26:29 PMbuiltins.AttributeError: 'NoneType' object has no attribute 'send'

lelit commented 7 years ago

Strange, I just tried to upgrade from CB 16.10.1/AB 0.17.1 to 17.2.1/0.17.2 to see if the problem is still there... I get the same error at shutdown in both cases.

oberstet commented 7 years ago

yeah, expected, this is fixed on the master branches .. new releases are coming this week

lelit commented 7 years ago

Great, thank you

colmaengus commented 7 years ago

I've been some more isolated testing and I'm seeing the above error a lot. It seems that when connections die their registrations aren't always cleaned up. I'm using an auto ping time of 10 seconds so I would expect that after this time everything should be cleaned up. But this is not always the case. I'm using shared registrations and I've seen that the router can try to call a procedure on a socket that is no longer connected and you get the above exception. When this happens the autobahn.js client never completes the call function. So a few quick questions regarding your fix: Will the client will get some kind of notification that the RPC call failed ? If its a shared procedure should the router try the next registered handler ?

With regard to the problem of registrations not being cleaned up, is this a known issue ?

colmaengus commented 7 years ago

I'm also separately testing a reconnect issue and came across #456. I need to register the shared procedure on reconnect. If there are zombie registrations hanging and a client calls the RPC will this be handled automatically by the router in the fix you mention above ?

colmaengus commented 7 years ago

I've updated to docker crossbario/crossbar:community-17.2.1 and now I'm seeing lots of : "[Router 14] skip sending msg - transport already closed" messages. They seem to be coming from RPC calls. Is there any way to debug these messages further ? Is there a mis match between the registration list at the active sessions ?

colmaengus commented 7 years ago

more info on this ... when I get details of the registrations using wamp.registration.list/wamp.registration.get/wamp.registration.list_callees I'm getting more workers than are currently running for a given uri. In one case the extra callee has a null for callee session id. Is this callee still being counted in the round robin RPC routing ? When should registrations with no active callees be removed ?

colmaengus commented 7 years ago

this issue is closely related to crossbario/autobahn-js/issues/127. In fact it would seem that dangling registrations are the underlying cause of the deferred exceptions referred to in this ticket.

oberstet commented 7 years ago

There is guarding code now https://github.com/crossbario/crossbar/blame/master/crossbar/router/router.py#L192

colmaengus commented 7 years ago

I've found a way to reproduce a dangling registration by having the caller and callee both disconnect mid-way though and rpc. I'm not sure if this is really what is happening in our production environment but at least it shows there is one way to get a dangling registration. In this case the detach handling in the dealer.py is calling invoke.caller._transport.send(reply) which never returns and so the removal of the registration never happens. Changing this to self._router.send(invoke.caller, reply) resolves the issue.

oberstet commented 7 years ago

@colmaengus did you actually create a test case that can reproduce the issue?

The situation you describe is like this?

caller issues call
crossbar forward invocation to callee
now both caller and callee disconnect (uncleanly or cleanly)
registration is dangling

oberstet commented 7 years ago

@markope are you actually running into this (or similar) too?

oberstet commented 7 years ago

Ah, and to @colmaengus (and @markope ): what crossbar version? a lot of fixes have gone into 17.5.1 .. I won't claim it is bug free though;)

colmaengus commented 7 years ago

Exactly, except its a manual test case. We are seeing dangling registrations in production but I'm not sure if this is the mechanism by which they occur. I've got a clone of master and I'm using the following to drop in local changes docker run -p 8080:8080 -v /$(pwd)/crossbar:/usr/local/lib/python3.6/site-packages/crossbar crossbario/crossbar:cpy3-17.5.1 --loglevel trace

oberstet commented 7 years ago

that docker line dropping in local changes, creative use;) curious: what changes?

I've reopened the issue: we have to investigate ..

colmaengus commented 7 years ago

Apart from debug logging to figure out what is going on the only change was in detach method of dealer.py

invoke.caller._transport.send(reply) to self._router.send(invoke.caller, reply)

markope commented 7 years ago

It's not enough to file and issue, but here is what we experience:

Our pi registers a function at a crossbar router
The pi is unplugged from power
Pi restarts and tries to reregister the same function
crossbar returns an error, that there is a registration in place already

While firing onJoin: Traceback (most recent call last):
Failure: autobahn.wamp.exception.ApplicationError: ApplicationError(error=<wamp.error.procedure_already_exists>, args=[u"register for already registered procedure 're.mgmt.000000004a3c584b.docker_pull'"], kwargs={}, enc_algo=None)

Also I can't tell you what autobahn that specific pi is running on (It's in the office and not older than one month I guess). The crossbar is

Crossbar.io : 17.3.1 (Crossbar.io COMMUNITY) Autobahn : 0.18.1 (with JSON, MessagePack, CBOR, UBJSON) Twisted : 17.1.0-EPollReactor LMDB : 0.92/lmdb-0.9.18 Python : 3.6.1/CPython OS : Linux-3.16.0-4-amd64-x86_64-with Machine : x86_64

Not very helpful I know. Maybe worth mentioning, we have

                  "auto_ping_interval": 10000,
                  "auto_ping_timeout": 5000,
                  "auto_ping_size": 4

in our web transport config in the crossbar config.

colmaengus commented 7 years ago

We see that exact behaviour with remote agents running autobahn.js. Because they are using a single rpc registration with a unique name once the situation occurs the only solution is to reboot crossbar which impacts all users of the system.

meejah commented 7 years ago

I have repeated this with crossbar on master; the line @colmaengus mentions in dealer.py is indeed the culprit. I'll try a more-reliable test and put in a fix.

meejah commented 7 years ago

Oh and BTW the key here really is both sides disconnecting:

the caller disconnects during a call
the router is trying to send the "in-flight disconnect" error back to the caller
...but the caller meantime has disconnected
...and so when it tries to send the "in-flight disconnect" there's an eaten error (transport is None) and the rest of the method doesn't run (the part that cleans up the reg).

So there's kind of two problems: the registration not being cleaned up, and the error from 4 not being logged.

meejah commented 7 years ago

Just "for the record" I made an end-to-end style example/test-case that works like this:

session0 registers a function
session1 calls the function above and does callLater(0, transport.abort) essentially
the handler for the session0 function does callLater(0.1, transport.abort) (basically) inside the registered function
the handler in session0 waits 0.2 seconds and returns a result (so, transport should be aborted first)

This reliably causes the bug.

oberstet commented 7 years ago

@markope from your description of your scenario, I'd say this isn't exactly the issue at hand, but a "works as expected", because of this:

CB will recognize the loss of connectivity to the Pi only after 5s (at most), because of the auto-ping-timeout setting you use.

Now, if the Pi reconnects faster than within 5s, the old connection will not have been recognized as being broken by CB yet, and hence the original procedure registration will still be around.

Here are the possible solutions I see:

decrease the auto-ping-timeout
increase the minimum/base auto-reconnect delay
have an option for register that allows to kick any existing procedure registration
have the client call the CB meta API to explicitly kick any existing procedure registration before (re-)registering

is implemented, but probably doesn't cut it, because a client (by default) will "immediately" try to reconnect
isn't exposed currently (the reconnection parameters, including the minimal base delay) are hard-coded (https://github.com/crossbario/autobahn-python/blob/master/autobahn/twisted/wamp.py#L331)
isn't there, but would be easy to add
is there, but requires additional app code, for a probably very common use case.

what do you think?

oberstet commented 7 years ago

rgd 2.: https://github.com/crossbario/autobahn-python/issues/588 rgd 3.: https://github.com/crossbario/crossbar/issues/1075

markope commented 7 years ago

yes, that would explain the behaviour. nr 3. sounds really like a common use case. So I guess three would be great. How much effort is this in EUR? Until then I'd explore 4.

oberstet commented 7 years ago

rgd 4: http://crossbar.io/docs/Registration-Meta-Events-and-Procedures/#forcefully-removing-a-callee

You need to first use the other meta API procedures to find the original (now dangling) session ID and registration ID to kick the registration - which is why this will be quite some lines of code. And you need to do that for each procedure. The meta API is powerful, and not tailored to this specific use case.

Rgd 3: I agree that this is a) common scenario and b) the proper approach.

From a user code perspective, that would be as easy as:

session.register("com.example.add2", add2, {force_reregister: true};

Plus (if used) respective permission settings for the authrole/URI in the CB node config to allow a client to do such thing.

rgd EUR: I mail you

colmaengus commented 7 years ago

Thanks for the fix guys. When do you think it will be available in a docker release ?

Regarding your feature proposal option 3 allowing a registration to be kicked would be a nice feature.

Sent from my iPhone

On 14 May 2017, at 09:54, Tobias Oberstein notifications@github.com wrote:

rgd 4: http://crossbar.io/docs/Registration-Meta-Events-and-Procedures/#forcefully-removing-a-callee

You need to first use the other meta API procedures to find the original (now dangling) session ID and registration ID to kick the registration - which is why this will be quite some lines of code. And you need to do that for each procedure. The meta API is powerful, and not tailored to this specific use case.

Rgd 3: I agree that this is a) common scenario and b) the proper approach.

From a user code perspective, that would be as easy as:

session.register("com.example.add2", add2, {force_reregister: true}; Plus (if used) respective permission settings for the authrole/URI in the CB node config to allow a client to do such thing.

rgd EUR: I mail you

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

oberstet commented 7 years ago

@colmaengus the fix of the original issue is shipped in 17.5.1 - and yep, we have Docker images https://github.com/crossbario/crossbar-docker#crossbar

colmaengus commented 7 years ago

I mean when will a release containing Issue980 dangling regs #1072 be available in a docker?

I used 17.5.1 to reproduce the dangling registration.

Sent from my iPhone

On 17 May 2017, at 19:21, Tobias Oberstein notifications@github.com wrote:

@colmaengus the fix of the original issue is shipped in 17.5.1 - and yep, we have Docker images https://github.com/crossbario/crossbar-docker#crossbar

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

oberstet commented 7 years ago

Ah, ok. Yes, this isn't yet released. Maybe this is a good time to put the automation pieces we have together: run the Makefiles in the crossbar-docker repo on Travis and publish to Docker hub each time we push a tag onto the master branch here. That would basically make releasing zero effort in terms of developer time. Which sounds kinda useful;) Mmh. There is this: https://sedden.github.io/blog/2016/04/11/automated-builds-for-arm-docker-containers-on-intel-hosts/

colmaengus commented 7 years ago

Would it be possible to get a new release including this bug fix ?

crossbario / crossbar

dangling registration - possible race cond #980