SimplyStaking / panic

PANIC Monitoring and Alerting For Blockchains
Apache License 2.0
85 stars 31 forks source link

Implement MonitorsManagers' functionality to start the SubstrateNetworkMonitor #251

Open dillu24 opened 2 years ago

dillu24 commented 2 years ago

Technical Story

When you have a multiprocessing system you have to watch out how many processes you are going to spawn. There are two reasons why you need to do this:

If we only focus on the Monitors, currently we are creating a manager process for every type of monitorable and a monitor process for every monitorable. For example, suppose that the user added 4 cosmos nodes and 1 DokerHub repo for monitoring. On startup PANIC is going to start a ContractsMonitorsManager, NetworkMonitorsManager, NodeMonitorsManager, SystemMonitorsManager, DockerHubMonitorsManager and a GitHubMonitorsManager all in a separate process. In addition to this, PANIC will start 4 CosmosNodeMonitors and 1 DockerHubMonitor in a separate process. As a result we are creating a lot of processes which will portentially increase as the node operator adds more monitorables. At a larger scale we might end up having a slow system and/or run out of memory.

To solve this it is being proposed that we start reducing the number of processes by using a combination of processes and threads. We can start by first focusing on the Monitors, benchmark the implementation and if there is benefit we would incorporate these changes to other components. The idea is to have a single MonitorsManager which spawns a thread for each monitorable. As per the resources below, threads are more memory efficient and lightweight to handle. When implementing the threaded monitor we have two options:

  1. To implement a long-lived thread which connects to rabbit once and performs work every 10 seconds.
  2. To implement a thread whose work is to connect to rabbit, do work, disconnects from rabbit and terminates.

It is suggested that we perform implementation 1 because according to the RabbitMQ docs the rabbit server works better with long-lived connections

For this huge task to be completed we need to tackle the following:

Therefore to easily handle this large change we will break the task described above into granular tickets.

The aim of this ticket is to develop a single MonitorsManager running in a separate process that is able to process the configurations required to start the SubstrateNetworkMonitor

Resources:

Requirements

We must create a single MonitorsManager running in a separate process that is able to do the following:

Some Notes:

Blocked by

241

Acceptance criteria

Given: The MonitorsManager receives new substrate node configurations Then: The MonitorsManager is able to start a new SubstrateNetworkMonitor in a separate thread

Given: The MonitorsManager receives updated substrate node configurations Then: The MonitorsManager is able to terminate each thread associated with an updated config and start a new one with the updated configs

Given: The MonitorsManager receives a removed substrate node configurations Then: The MonitorsManager is able to terminate each thread associated with a removed configuration

Given: The MonitorsManager accesses shared memory Then: It can do so without any race conditions / errors

dillu24 commented 2 years ago

@simplyrider Also suggested another approach for implementing the Monitors architecture:

To make sure that the system never runs out of memory and CPU processing power is kept to a low we must make sure that as the user adds more monitorables there aren't a lot of threads/processes running at the same time. A good approach to manage this is to implement a queue which manages how many threads execute at the same time. Therefore we can have the following:

  1. We would have 1 MonitorsManager that has 1 thread listening for all type of monitorable configs and the other thread executing a batch of tasks from a multiprocessing queue every X seconds (X should vary according to how many monitorables we have).
  2. When a configuration is received, the MonitorManager adds a task on a multiprocessing queue for each configuration. In this task we must specify the monitor strategy to execute, and the corresponding configuration.
  3. Once 5 seconds elapse for the task thread, the task thread grabs Y tasks from the queue(Y should very according to how many monitorables we have), checks that their configurations were not updated by the user and if not it starts a monitor thread for that configuration. Afterwards it puts the same task to the end of the queue for another monitoring round later on.
  4. The monitor process needs to connect to rabbit with a separate connection, retrieve data, send it and disconnect from rabbit.

Some Notes:

Some resources: