RolnickLab / ami-platform

GNU General Public License v3.0
8 stars 3 forks source link

Stage Implementation Monitoring #538

Open kaviecos opened 3 weeks ago

kaviecos commented 3 weeks ago

Think about a way to see if a stage implementation is online or when it was last seen - Add some sort of basic healthcheck endpoint (see kubernetes /livez & /readyz pattern)? or can we check when a message was last taken from the queue by a stage?

kaviecos commented 2 weeks ago

Suggestion:

Consumers registers at the controller when coming online (project+stage, queuename). While online they periodically send "alive" (heartbeats) messages to the controller. Registration and heartbeats can be done via an HTTP API or using RabbitMQ. The controller keeps track of all consumers. Users can query this information using the controller API.

Requirements for this to work:

  1. Compute clusters must be able to connect to RabbitMQ
  2. If heartbeats are via HTTP, then compute clusters must also be able to connect to the controller over HTTPS

Benefits of using messages for heartbeats: Consumers does not need access to the controller's API key. Downside: The messaging adds an overhead and a heartbeat message could be delayed in the queue.