Closed colmaengus closed 7 years ago
Strange, I just tried to upgrade from CB 16.10.1/AB 0.17.1 to 17.2.1/0.17.2 to see if the problem is still there... I get the same error at shutdown in both cases.
yeah, expected, this is fixed on the master branches .. new releases are coming this week
Great, thank you
I've been some more isolated testing and I'm seeing the above error a lot. It seems that when connections die their registrations aren't always cleaned up. I'm using an auto ping time of 10 seconds so I would expect that after this time everything should be cleaned up. But this is not always the case. I'm using shared registrations and I've seen that the router can try to call a procedure on a socket that is no longer connected and you get the above exception. When this happens the autobahn.js client never completes the call function. So a few quick questions regarding your fix: Will the client will get some kind of notification that the RPC call failed ? If its a shared procedure should the router try the next registered handler ?
With regard to the problem of registrations not being cleaned up, is this a known issue ?
I'm also separately testing a reconnect issue and came across #456. I need to register the shared procedure on reconnect. If there are zombie registrations hanging and a client calls the RPC will this be handled automatically by the router in the fix you mention above ?
I've updated to docker crossbario/crossbar:community-17.2.1 and now I'm seeing lots of : "[Router 14] skip sending msg - transport already closed" messages. They seem to be coming from RPC calls. Is there any way to debug these messages further ? Is there a mis match between the registration list at the active sessions ?
more info on this ... when I get details of the registrations using wamp.registration.list/wamp.registration.get/wamp.registration.list_callees I'm getting more workers than are currently running for a given uri. In one case the extra callee has a null for callee session id. Is this callee still being counted in the round robin RPC routing ? When should registrations with no active callees be removed ?
this issue is closely related to crossbario/autobahn-js/issues/127. In fact it would seem that dangling registrations are the underlying cause of the deferred exceptions referred to in this ticket.
There is guarding code now https://github.com/crossbario/crossbar/blame/master/crossbar/router/router.py#L192
I've found a way to reproduce a dangling registration by having the caller and callee both disconnect mid-way though and rpc. I'm not sure if this is really what is happening in our production environment but at least it shows there is one way to get a dangling registration. In this case the detach handling in the dealer.py is calling invoke.caller._transport.send(reply) which never returns and so the removal of the registration never happens. Changing this to self._router.send(invoke.caller, reply) resolves the issue.
@colmaengus did you actually create a test case that can reproduce the issue?
The situation you describe is like this?
@markope are you actually running into this (or similar) too?
Ah, and to @colmaengus (and @markope ): what crossbar version? a lot of fixes have gone into 17.5.1 .. I won't claim it is bug free though;)
Exactly, except its a manual test case. We are seeing dangling registrations in production but I'm not sure if this is the mechanism by which they occur. I've got a clone of master and I'm using the following to drop in local changes docker run -p 8080:8080 -v /$(pwd)/crossbar:/usr/local/lib/python3.6/site-packages/crossbar crossbario/crossbar:cpy3-17.5.1 --loglevel trace
that docker line dropping in local changes, creative use;) curious: what changes?
I've reopened the issue: we have to investigate ..
Apart from debug logging to figure out what is going on the only change was in detach method of dealer.py
invoke.caller._transport.send(reply) to self._router.send(invoke.caller, reply)
It's not enough to file and issue, but here is what we experience:
While firing onJoin: Traceback (most recent call last):
Failure: autobahn.wamp.exception.ApplicationError: ApplicationError(error=<wamp.error.procedure_already_exists>, args=[u"register for already registered procedure 're.mgmt.000000004a3c584b.docker_pull'"], kwargs={}, enc_algo=None)
Also I can't tell you what autobahn that specific pi is running on (It's in the office and not older than one month I guess). The crossbar is
Crossbar.io : 17.3.1 (Crossbar.io COMMUNITY) Autobahn : 0.18.1 (with JSON, MessagePack, CBOR, UBJSON) Twisted : 17.1.0-EPollReactor LMDB : 0.92/lmdb-0.9.18 Python : 3.6.1/CPython OS : Linux-3.16.0-4-amd64-x86_64-with Machine : x86_64
Not very helpful I know. Maybe worth mentioning, we have
"auto_ping_interval": 10000,
"auto_ping_timeout": 5000,
"auto_ping_size": 4
in our web transport config in the crossbar config.
We see that exact behaviour with remote agents running autobahn.js. Because they are using a single rpc registration with a unique name once the situation occurs the only solution is to reboot crossbar which impacts all users of the system.
I have repeated this with crossbar on master; the line @colmaengus mentions in dealer.py
is indeed the culprit. I'll try a more-reliable test and put in a fix.
Oh and BTW the key here really is both sides disconnecting:
call
So there's kind of two problems: the registration not being cleaned up, and the error from 4 not being logged.
Just "for the record" I made an end-to-end style example/test-case that works like this:
callLater(0, transport.abort)
essentiallycallLater(0.1, transport.abort)
(basically) inside the registered functionThis reliably causes the bug.
@markope from your description of your scenario, I'd say this isn't exactly the issue at hand, but a "works as expected", because of this:
CB will recognize the loss of connectivity to the Pi only after 5s (at most), because of the auto-ping-timeout
setting you use.
Now, if the Pi reconnects faster than within 5s, the old connection will not have been recognized as being broken by CB yet, and hence the original procedure registration will still be around.
Here are the possible solutions I see:
auto-ping-timeout
register
that allows to kick any existing procedure registrationwhat do you think?
yes, that would explain the behaviour. nr 3. sounds really like a common use case. So I guess three would be great. How much effort is this in EUR? Until then I'd explore 4.
rgd 4: http://crossbar.io/docs/Registration-Meta-Events-and-Procedures/#forcefully-removing-a-callee
You need to first use the other meta API procedures to find the original (now dangling) session ID and registration ID to kick the registration - which is why this will be quite some lines of code. And you need to do that for each procedure. The meta API is powerful, and not tailored to this specific use case.
Rgd 3: I agree that this is a) common scenario and b) the proper approach.
From a user code perspective, that would be as easy as:
session.register("com.example.add2", add2, {force_reregister: true};
Plus (if used) respective permission settings for the authrole/URI in the CB node config to allow a client to do such thing.
rgd EUR: I mail you
Thanks for the fix guys. When do you think it will be available in a docker release ?
Regarding your feature proposal option 3 allowing a registration to be kicked would be a nice feature.
Sent from my iPhone
On 14 May 2017, at 09:54, Tobias Oberstein notifications@github.com wrote:
rgd 4: http://crossbar.io/docs/Registration-Meta-Events-and-Procedures/#forcefully-removing-a-callee
You need to first use the other meta API procedures to find the original (now dangling) session ID and registration ID to kick the registration - which is why this will be quite some lines of code. And you need to do that for each procedure. The meta API is powerful, and not tailored to this specific use case.
Rgd 3: I agree that this is a) common scenario and b) the proper approach.
From a user code perspective, that would be as easy as:
session.register("com.example.add2", add2, {force_reregister: true}; Plus (if used) respective permission settings for the authrole/URI in the CB node config to allow a client to do such thing.
rgd EUR: I mail you
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@colmaengus the fix of the original issue is shipped in 17.5.1 - and yep, we have Docker images https://github.com/crossbario/crossbar-docker#crossbar
I mean when will a release containing Issue980 dangling regs #1072 be available in a docker?
I used 17.5.1 to reproduce the dangling registration.
Sent from my iPhone
On 17 May 2017, at 19:21, Tobias Oberstein notifications@github.com wrote:
@colmaengus the fix of the original issue is shipped in 17.5.1 - and yep, we have Docker images https://github.com/crossbario/crossbar-docker#crossbar
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Ah, ok. Yes, this isn't yet released. Maybe this is a good time to put the automation pieces we have together: run the Makefiles in the crossbar-docker repo on Travis and publish to Docker hub each time we push a tag onto the master branch here. That would basically make releasing zero effort in terms of developer time. Which sounds kinda useful;) Mmh. There is this: https://sedden.github.io/blog/2016/04/11/automated-builds-for-arm-docker-containers-on-intel-hosts/
Would it be possible to get a new release including this bug fix ?
We been using crossbar 16.10.1 since its release last Nov and its been very stable. As soon as we upgraded to 17.2.1 we've been getting unhandled exceptions when handling an RPC from autobahn.js.
I know there is an existing fix to avoid the unhandled exception however I'm more concerned about why the session transport has disappeared.
What is strange is that there are no errors in the autobahn.js side (running in node.js). The call promise just never gets resolved.
As soon as I roll back to 16.10.1 the problem disappears.
The stack trace is