crossbario / crossbar

Crossbar.io - WAMP application router
https://crossbar.io/
Other
2.05k stars 275 forks source link

Subscriptions And Registration Not Being Forwarded Over RLink On Reconnect Or Late Join #2079

Open Skully17 opened 1 year ago

Skully17 commented 1 year ago

Hello,

I have a system that uses a local Crossbar router which connects, using RLink, to a remote Crossbar router on the cloud. The remote router needs to be able to disconnect and reconnect at any time and to be able to connect later after the initial boot. I tested these scenarios out and found that WAMP tasks on the remote side did not work after reconnecting or when connecting after boot.

After poking around at the code, all in rlink.py, I found two issues:

With the first issue, here is what I think happens. When a subscription is made, on_subscription_create is called and the sub’s ID is stored in the _subs dictionary. The actual subscription is only made and stored in this dictionary if there is an active RLink connection. When an RLink connection is made, for example after a temporary disconnection, forward_current_subs is called and on_subscription_create is called for every single existing subscription. There is then a check in on_subscription_create that stops the creation if the sub ID already exists in _subs. This means that, even though the sub was not made in the first place due to the remote router not being connected, it assumes that the subscription has already been made and doesn’t try to create it. The same problem happens for registrations as well.

The second issue was much simpler. When the remote router disconnects the subscriptions aren’t removed from the _subs dictionary. When the remote router reconnects, no subscriptions are forwarded to the remote router because they already exist in _subs. This problem was fixed for registrations by adding an on_leave that just sets _regs to an empty dictionary when the remote router disconnects. I have done a similar thing for subscriptions.

oberstet commented 1 year ago

first of all, kudos for diving that deep into rlinks and router-to-router operations;)

the code for rlinks, including the features that it depends on (such as the meta API) is pretty complex. there will be bugs. however, it is - I would say - quite organized, and in fact, the problems you described are likely in the file you mentioned

https://github.com/crossbario/crossbar/blob/master/crossbar/worker/rlink.py

I'm currently busy with other things and hence didn't look deeply into what you raised and mention, however, from my point of view, one general comment:

The bugs still there are related to complex situations including recovery of node failures in a network of nodes and such, and reliably fixing them without introducing new bugs will require automated tests.

Poking around in the code without automated test cases will unlikely succeed (because of the complexity).

My overarching, general main worry hence is: how do we come up with a suitable set of test scenarios and automated tests to cover router-to-router app messaging ... to test the promises r2r makes wrt apps?