crossbario / crossbar

Crossbar.io - WAMP application router
https://crossbar.io/
Other
2.05k stars 275 forks source link

Shared registration: Callee is None #1850

Open jegger opened 3 years ago

jegger commented 3 years ago

Current setup One crossbar router (pypy, 20.5.1.dev1) with several services registered for shared registrations (round-robin). The services leave and join from time to time when they are updated.

Issue We noticed that some WAMP RPC calls never got answered. Actually we figured out that every 6 call was a shot into the dark. After some investigation we realized that these calls never hit one of the services and by calling the _wamp.registration.listcallees endpoint of crossbar we got the following list returned: [None, id1, id2, id3, id4, id5]. It states that there are five ids which match the five services running. But there is one called None. We then tried to remove that None by _wamp.subscription.removecallee - but of course without any luck. Only a restart of crossbar was able to fix the issue.

My guess is that crossbar did not remove a callee from the list when that callee left.

Thanks

oberstet commented 3 years ago

by calling the wamp.registration.list_callees endpoint of crossbar we got the following list returned: [None, id1, id2, id3, id4, id5]

this seems indeed wrong. we need to investigate. we need to be able to create a test case that provokes this issue while running, then fix the bug, see the test case succeed.

we do run a whole set of integration tests in the CI

https://github.com/crossbario/crossbar/runs/1736730732?check_suite_focus=true#step:9:1 https://github.com/crossbario/autobahn-python/blob/a468e5724b588ba51e3bbcad42317b906ac0606b/examples/run-all-examples.py#L138

but not sure about shared registrations


I guess that is a bug in crossbar? Sadly I am not able to provide any useful logs of crossbar

yes, it looks like a bug under some specific condition ..

Can we apply any configuration to prevent that situation?

you should have both client and router initiated heart-beating enabled

Can you imagine a way of fixing the issue without a crossbar restart?

yes: unregister all callees. that will make the whole registration go away in the router. that should purge everything still hanging around on that registration.

jegger commented 3 years ago

you should have both client and router initiated heart-beating enabled

This is in place (we have autoPingInterval=10, autoPingTimeout=60 on client and in the crossbar config interval=40000, timeout=60000, size=4)

yes: unregister all callees. that will make the whole registration go away in the router. that should purge everything still hanging around on that registration.

How would I do that? I only see the procedure to unregister one callee at once: https://crossbar.io/docs/Registration-Meta-Events-and-Procedures/ - Or did you mean that I would need to unregister the five valid IDs and it should then remove the None in the end as well?

oberstet commented 3 years ago

using https://crossbar.io/docs/Registration-Meta-Events-and-Procedures/#retrieving-information-about-a-specific-registration you can list all callees, and call into remove callee for each non-None value

once all callees are gone, the registration is removed in the router, and doing so also then fires "wamp.registration.on_delete"

I'm not sure what the None value is .. maybe it is an internal registration observer that slipped into that returned list - in which case, the registration should really go away once you removed all non-None callees. maybe there is a real bug in that a callee that has disappeared has not been removed from the list.

Jopie64 commented 3 years ago

I think we ran into a similar issue. We use shared registrations. Recently we had an incident that when we restarted a WAMP node that had such a shared RPC registered, clients reported that their functionality worked only part of the time. We figured that some RPC calls were not arriving at the callee. A restart of the Crossbar router did the trick.

I think it is also related to #980, which from the looks of it, appears to be fixed...

Another thing to note is that, we use Crossbar for years now, and this is the first time we know we had such an incident. So it seems a rare scenario/race.

oberstet commented 3 years ago

@Jopie64 thanks for your notes!

So it seems a rare scenario/race.

I think the best way would be to add functional tests which run under high load to Crossbar.io CI

This is possible of course, and actually we do have some already, but not one touching the features in the issues here (meta API under load)

on a sidenote, as we are talking: looking forward rgd WAMP/Crossbar.io, you might be interested in https://github.com/wamp-proto/wamp-proto/issues/387