cbm-fles / flesnet

CBM FLES Timeslice Building
7 stars 22 forks source link

Unexpected heartbeat timeouts at low timeslice rates #99

Open cuveland opened 2 years ago

cuveland commented 2 years ago

Dirk reports: Esteban tested something on mFLES and got "Worker protocol violation: connection heartbeat expired" errors. After that the worker disconnects and connects again. He says he is doing something with the GBT links but this looks to me like something else: /home/flesctl/run/2323/slurm.out

Background: The worker (in this case probably the tsclient) gives this message when it is idle and has not received a heartbeat request from the distributor (flesnet) for 2 seconds. Then it closes the connection and connects again, because it assumes that flesnet has been restarted. However, the distributor sends a heartbeat request every 0.5 s to the workers that are currently idle. Actually, if no timeslices are being built, there should still be the heartbeat messages.

Maybe the interface was not tested with interruptions in the timeslice data stream? The workers connect to the shared memory only when they get the first timeslice, not immediately at "login". So, it may well be that the scenario was not covered by the tests.