Open dillu24 opened 2 years ago
@simplyrider Also suggested another approach for implementing the Monitors architecture:
To make sure that the system never runs out of memory and CPU processing power is kept to a low we must make sure that as the user adds more monitorables there aren't a lot of threads/processes running at the same time. A good approach to manage this is to implement a queue which manages how many threads execute at the same time. Therefore we can have the following:
MonitorsManager
that has 1 thread listening for all type of monitorable configs and the other thread executing a batch of tasks from a multiprocessing queue every X seconds (X should vary according to how many monitorables we have).MonitorManager
adds a task on a multiprocessing queue for each configuration. In this task we must specify the monitor strategy to execute, and the corresponding configuration.Some Notes:
MonitorsManager
or remove it altogether if deemed not useful.Some resources:
Technical Story
When you have a multiprocessing system you have to watch out how many processes you are going to spawn. There are two reasons why you need to do this:
If we only focus on the Monitors, currently we are creating a manager process for every type of monitorable and a monitor process for every monitorable. For example, suppose that the user added 4 cosmos nodes and 1 DokerHub repo for monitoring. On startup PANIC is going to start a
ContractsMonitorsManager
,NetworkMonitorsManager
,NodeMonitorsManager
,SystemMonitorsManager
,DockerHubMonitorsManager
and aGitHubMonitorsManager
all in a separate process. In addition to this, PANIC will start 4CosmosNodeMonitors
and 1DockerHubMonitor
in a separate process. As a result we are creating a lot of processes which will portentially increase as the node operator adds more monitorables. At a larger scale we might end up having a slow system and/or run out of memory.To solve this it is being proposed that we start reducing the number of processes by using a combination of processes and threads. We can start by first focusing on the Monitors, benchmark the implementation and if there is benefit we would incorporate these changes to other components. The idea is to have a single
MonitorsManager
which spawns a thread for each monitorable. As per the resources below, threads are more memory efficient and lightweight to handle. When implementing the threaded monitor we have two options:It is suggested that we perform implementation 1 because according to the
RabbitMQ
docs the rabbit server works better with long-lived connectionsFor this huge task to be completed we need to tackle the following:
MonitorsManager
that is able to receive configurations and use the appropriate strategy to start a monitor in a separate thread based on the routing keyTherefore to easily handle this large change we will break the task described above into granular tickets.
The aim of this ticket is to develop a single
MonitorsManager
running in a separate process that is able to process the configurations required to start theSubstrateNetworkMonitor
Resources:
Requirements
We must create a single
MonitorsManager
running in a separate process that is able to do the following:SubstrateNetworkMonitor
for each new configuration in a separate threadSubstrateNetworkMonitor
for each updated configuration in a separate threadSubstrateNetworkMonitor
run_alerter.py
which are related to theMonitorsManager
Some Notes:
MonitorsManager
job is to detect which are the new, updated and removed configurations in order to create, terminate and update monitor threads.MonitorsManger
must interact with theMonitorStarters
class via the appropriateMonitorStrategy
SubstrateNetworkMonitor
implementation should not be effected we might require a lock to access shared objects such as the list ofSubstrateNodes
to be used as data sources. This needs further investigation, however, if we create new objects from dictionaries for a particular thread we might not need locks because each object would be unique. However, with this approach whenever there is a config update/removal we need to restart a long-lived thread.Blocked by
241
Acceptance criteria
Given: The
MonitorsManager
receives new substrate node configurations Then: TheMonitorsManager
is able to start a newSubstrateNetworkMonitor
in a separate threadGiven: The
MonitorsManager
receives updated substrate node configurations Then: TheMonitorsManager
is able to terminate each thread associated with an updated config and start a new one with the updated configsGiven: The
MonitorsManager
receives a removed substrate node configurations Then: TheMonitorsManager
is able to terminate each thread associated with a removed configurationGiven: The
MonitorsManager
accesses shared memory Then: It can do so without any race conditions / errors