The Kronos daemon has had problems for years. We did various works to keep it running. Now it is time to do an overall rework on the daemon to fix the problem for good. The to-dos from the meeting with AMQ admin, Rucio project leader, and cms developers are below, there is a lot works needs to be done from the Rucio development side.
To-Dos
Discuss connection time with Radu (He was on vacation). His patch reuses existing connections and removes stalled connections. However, it still timeout a connection. If a connection should be kept as long as it can be, or should it be timeout with every run of run_once_kronos_file function in kronos.py?
Handle the case while a file is frequently opened. Martin suggested that add a cache to keep the recently updated DIDs and ignore them if the next access_time is in n seconds. While the n is configurable.
Rucio should move to the latest version of STOMP protocol and library.
Configure client to send heart-beats to broker.
Disconnect gracefully.
Explicitly remove subscriptions when they are not needed anymore.
The Kronos daemon has had problems for years. We did various works to keep it running. Now it is time to do an overall rework on the daemon to fix the problem for good. The to-dos from the meeting with AMQ admin, Rucio project leader, and cms developers are below, there is a lot works needs to be done from the Rucio development side.
To-Dos