Support multiple endpoints and receive data by train stride

philsmt commented 1 year ago

Client part of https://git.xfel.eu/karaboDevices/TrainMatcher/-/merge_requests/42

Adds support to select to only receive certain trains via a divisor and remainder, as well as the DEALER pattern to connect to multiple endpoints, each sending part of a train's data, and match the data afterwards. The matching part is not robust against the bridges going out sync at this point (e.g. asking for even trains may not yield the same result across two bridges). The current plan is to see in operations how much of a problem this is.

codecov[bot] commented 1 year ago

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
karabo_bridge/client.py	`86.15% <80.00%> (-6.87%)`	:arrow_down:

:loudspeaker: Thoughts on this report? Let us know!.

takluyver commented 1 year ago

I think you're already aware how this could go wrong if the senders aren't sufficiently aligned, especially if one of them skips over the key train we're waiting for - it will wait for divisor trains to get the next matching one.

But I think it's also going to get very confusing if one of the addresses you give it is wrong - either a typo, or a sender that's not running. Because ZMQ allows you to connect() before the other side calls bind(), connect() always succeeds, and it tries to establish the connection in the background. So you send out N requests but only have M<N connected peers; ZMQ either sends two requests to the same peer, or else queues up messages for 'to be connected' peers. Of course, making a mistake is already a waiting game with a single sender, but I suspect it's going to be both easier to do and more confusing with multiple.

If we focus less on making something similar to the existing Karabo-bridge protocol, what might a solution look like? For instance:

Client connects to a REP-REQ socket, sends a request for a 'modulo 5' channel
- This can have a fairly short timeout so you get a failure fast if there's no reply
Sender creates a PUB socket for 'modulo 5' (if it doesn't already have one), replies with an address to connect to
- The request could also go to a coordinator which talks to N senders and replies with N addresses
Sender starts sending out messages with the remainder (mod 5) as a topic
Client creates a SUB socket, connects to the address it got, and subscribes to its desired remainder

I'm not saying we should do it exactly like that, but I don't think the reply-with-data pattern is a great starting point, and I don't see much benefit of sticking to it given that we have to change both sides anyway.

philsmt commented 1 year ago

Thanks for having a look. I want to stress that this concept tries to enhance the current semantics without being overly intrusive at all. After all, the N senders to M train-parallelized clients is still a bit of a niche pattern.

I think you're already aware how this could go wrong if the senders aren't sufficiently aligned, especially if one of them skips over the key train we're waiting for - it will wait for divisor trains to get the next matching one. But I think it's also going to get very confusing if one of the addresses you give it is wrong - either a typo, or a sender that's not running. Because ZMQ allows you to connect() before the other side calls bind(), connect() always succeeds, and it tries to establish the connection in the background. So you send out N requests but only have M<N connected peers; ZMQ either sends two requests to the same peer, or else queues up messages for 'to be connected' peers. Of course, making a mistake is already a waiting game with a single sender, but I suspect it's going to be both easier to do and more confusing with multiple.

Yes, the matching part on the client side is extraordinary naive at the moment. It definitely needs a timeout to cope with missing bridges, and a check whether the train IDs actually match up. Frankly I'm still unsure how much of a problem misaligned bridges will be. Amusingly, it will become more stable the more clients with a higher divisor are connected.

If we focus less on making something similar to the existing Karabo-bridge protocol, what might a solution look like? For instance:

Client connects to a REP-REQ socket, sends a request for a 'modulo 5' channel

This can have a fairly short timeout so you get a failure fast if there's no reply

Sender creates a PUB socket for 'modulo 5' (if it doesn't already have one), replies with an address to connect to

The request could also go to a coordinator which talks to N senders and replies with N addresses

Sender starts sending out messages with the remainder (mod 5) as a topic

Client creates a SUB socket, connects to the address it got, and subscribes to its desired remainder

I considered the PUB-SUB idea first and found the bridge-side setting of it a major disadvantage. It's indeed an interesting idea to make the configuration part of the protocol. However, in the end we may also simply create additional bridges for such cases (and leave a default one intact, remember TrainMatchers can have multiple bridges already) to avoid handling teardown of such sockets as well.

I'm not saying we should do it exactly like that, but I don't think the reply-with-data pattern is a great starting point, and I don't see much benefit of sticking to it given that we have to change both sides anyway.

Agree, there are much better designs possible with deeper changes. As you suggest, I would actually start with the patterns on both ends, replacing the current push/poll concept with an actual loop involving control messages. This would significantly reduce the risk for the "simple protocol" of this PR to get out of sync, and allow even deeper alignment based on actual train IDs.

European-XFEL / karabo-bridge-py

Support multiple endpoints and receive data by train stride #70

Codecov Report