GaloisInc / BESSPIN-Tool-Suite

The core tool of the BESSPIN Framework.
Other
5 stars 2 forks source link

Heartbeat monitor #1254

Closed podhrmic closed 2 years ago

podhrmic commented 3 years ago
podhrmic commented 3 years ago

Components that respond to Heartbeat Req on CMD bus (TCP)

Need to be queried with the hearbeat monitor, should be always online

setupEnv.Json -> cyberPhysNodes

To query these components, do periodically (on the CAN TCP bus):

  1. send CAN_ID_HEARTBEAT_REQ
  2. wait for responses (until TIMEOUT)

If the component responds within the TIMEOUT it is considered healthy.

Components/services that can be queried over ssh

with systemctl status $SERVICE_NAME

To query these components, query them over ssh and parse the response. If the response contains Active: active (running) then the component is healthy. If the response contains Active: inactive then the component is not healthy (error).

Components that respond Hearbeat Req on CAN bus (UDP)

Occasionally these will be down during reset (keep track of reset requests).

To query these components, do periodically (on the CAN UDP bus)

  1. send CAN_ID_HEARTBEAT_REQ
  2. wait for responses (until TIMEOUT)

If the component responds within the TIMEOUT it is considered healthy. In addition, FreeRTOS might send a CAN_ID_CMD_COMPONENT_ERROR(SENSOR_THROTTLE|SENSOR_BRAKE) message (on port 5002), in which case we need to request restarting Teensy with CAN_ID_CMD_RESTART(TEENSY) on the TCP bus. The messages might be coming relatively quickly, so keep track of the reset requests (no more than one at a time).

Components that respond to HTTP requests

Occasionally these will be down during reset (keep track of reset requests).

To query these components, send a HTTP request to each server.

podhrmic commented 3 years ago

Have a class that represents the abstract component (list of components above). Components have a notion of health (states are: HEALTHY/ERROR/maybe WARNING). Not every component is queried the same way, some components need to be tested in more than one way to determine they are healthy.

The monitor collects this information about the system, and prints it on demand.

podhrmic commented 2 years ago

Fixed in #1263