Technical Story

When you have a multiprocessing system you have to watch out how many processes you are going to spawn. There are two reasons why you need to do this:

If the number of processes becomes a lot bigger than the number of CPUs then the Operating System will spend most of its time context switching between processes which is time consuming
Processes take a lot of memory, therefore if you spawn a lot of processes you may eventually run out of memory

If we only focus on the Monitors, currently we are creating a manager process for every type of monitorable and a monitor process for every monitorable. For example, suppose that the user added 4 cosmos nodes and 1 DokerHub repo for monitoring. On startup PANIC is going to start a ContractsMonitorsManager, NetworkMonitorsManager, NodeMonitorsManager, SystemMonitorsManager, DockerHubMonitorsManager and a GitHubMonitorsManager all in a separate process. In addition to this, PANIC will start 4 CosmosNodeMonitors and 1 DockerHubMonitor in a separate process. As a result we are creating a lot of processes which will portentially increase as the node operator adds more monitorables. At a larger scale we might end up having a slow system and/or run out of memory.

To solve this it is being proposed that we start reducing the number of processes by using a combination of processes and threads. We can start by first focusing on the Monitors, benchmark the implementation and if there is benefit we would incorporate these changes to other components. The idea is to have a single MonitorsManager which spawns a thread for each monitorable. As per the resources below, threads are more memory efficient and lightweight to handle. When implementing the threaded monitor we have two options:

To implement a long-lived thread which connects to rabbit once and performs work every 10 seconds.
To implement a thread whose work is to connect to rabbit, do work, disconnects from rabbit and terminates.

It is suggested that we perform implementation 1 because according to the RabbitMQ docs the rabbit server works better with long-lived connections

For this huge task to be completed we need to tackle the following:

Implement the Strategy Pattern for every type of monitor to have a code-base of higher quality
Implement a single MonitorsManager that is able to receive configurations and use the appropriate strategy to start a monitor in a separate thread based on the routing key
Add the monitorables store functionality that we already have in present managers.
Add the heartbeats functionality that we already have in present managers.

Therefore to easily handle this large change we will break the task described above into granular tickets.

The aim of this ticket is to develop a single MonitorsManager running in a separate process that is able to process the configurations required to start the SubstrateNetworkMonitor

Resources:

Requirements

We must create a single MonitorsManager running in a separate process that is able to do the following:

[ ] Receive new substrate node configurations and start a SubstrateNetworkMonitor for each new configuration in a separate thread
[ ] Receive updated substrate node configurations and restart a SubstrateNetworkMonitor for each updated configuration in a separate thread
[ ] Receive removed substrate node configurations and terminate every related SubstrateNetworkMonitor
[ ] Update the pre-declared config queues from the run_alerter.py which are related to the MonitorsManager

Some Notes:

The MonitorsManager job is to detect which are the new, updated and removed configurations in order to create, terminate and update monitor threads.
To start a monitor in a separate thread the MonitorsManger must interact with the MonitorStarters class via the appropriate MonitorStrategy
Objects such as data sources may be shared between threads, this means that although the bulk of the SubstrateNetworkMonitor implementation should not be effected we might require a lock to access shared objects such as the list of SubstrateNodes to be used as data sources. This needs further investigation, however, if we create new objects from dictionaries for a particular thread we might not need locks because each object would be unique. However, with this approach whenever there is a config update/removal we need to restart a long-lived thread.
Since we are able to use shared memory we can use a single Monitors logger for both the manager and the individual monitors running in separate threads. This will help us to reduce the number of log files that are currently being generated by PANIC. However we must make sure that we are able to do so in a safe manner. According to this thread it seems that the logging module is thread safe https://stackoverflow.com/questions/2973900/is-pythons-logging-module-thread-safe
Each thread must have its own Rabbit connection as Pika is not thread safe
Each thread must be set as daemon just in case the parent process needs to exit.
Each monitor must handle the thread stopping criteria possibly using exit flags or exceptions.

Blocked by

241

Acceptance criteria

Given: The MonitorsManager receives new substrate node configurations Then: The MonitorsManager is able to start a new SubstrateNetworkMonitor in a separate thread

Given: The MonitorsManager receives updated substrate node configurations Then: The MonitorsManager is able to terminate each thread associated with an updated config and start a new one with the updated configs

Given: The MonitorsManager receives a removed substrate node configurations Then: The MonitorsManager is able to terminate each thread associated with a removed configuration

Given: The MonitorsManager accesses shared memory Then: It can do so without any race conditions / errors

SimplyStaking / panic

Implement MonitorsManagers' functionality to start the SubstrateNetworkMonitor #251

Technical Story

Requirements

Blocked by

241

Acceptance criteria