This is a refactor to remedy the bot lifecycle problems experienced recently. It is based on three things:
Changing the messaging architecture to remove codependency between forta-supervisor and forta-scanner as it can cause confusing and unpredictable flows
Rethinking components in single responsibility principle and separating into new components so the behavior and the relations are more clear
Achieving a more modular composition and increasing unit testability
Review intro
Dive into the architecture under services/components directory:
lifecycle package: Bot manager manages lifecycle and calls the bot pool through a mediator (i.e. nats)
Bot manager: forta-supervisor component that reads assigned bot list, detects down bot containers, manages containers and publishes updates. agent.versions.latest is removed and only agent.status.running is used. There is a new agent.status.stopping for about-to-be-stopped containers.
Bot pool: forta-scanner component that receives the updates and manages the list of bot clients. It does not communicate back to forta-supervisor anymore. This implementation is a copy and modification of the removed agent pool.
botio package
Bot client: This is the removed poolagent.Agent and the bot pool in lifecycle package manages these, as before. It dials and initializes in Initialize() and deals with its own combiner subscriptions.
Request sender: This is the implementation that does SendEvaluateXXXRequest methods. It should look familiar.
registry package: This is not a separate service anymore and is something called by the bot manager, in forta-supervisor.
containers package
Container definition: Moved bot container definition and the label values here. The future plan is moving other supervisor container definitions to here to simplify supervisor.
Bot client: This a bot client that manages bot containers (not to be confused with the bot client in the lifecycle package). It includes a brief version of the bot container creation logic from supervisor and some extra methods to manage containers easily.
metrics package: I moved this to the service components as it is relevant to the services and I added an implementation.
New lifecycle metrics: Added the "stopping", "initialized", "restart" metrics along with the new failure metrics for pulls, bot container launches, dials and initializes
Lifecycle metric client: To make metric publishing easier in various parts of the coded, I added a lifecycle metrics client that allows us to do like bot.lifecycleMetrics.FailureDial(botConfig).
forta-supervisor
Now it runs the bot manager which is in the lifecycle package and calls botManager.ManageBots(), botManager.RestartExitedBots() every 15 seconds. This causes running, stopping, restarting containers and bot lists are published. We keep on publishing the latest list of running bots every 15s to enforce corrective behaviour.
forta-scanner
This runs the bot pool which is in the lifecycle package and connects to bot manager through the mediator (nats) so the bot manager can call the bot pool with bot config update messages.
New bot clients are created for the new containers and added to pool. Shut-down containers' clients are removed. Initialize is safe to call multiple times and the gRPC client is replaced each time in a thread safe way so we just call initialize again for the restarted containers. Added a TODO to retry initialize multiple times later, as we did in one of the PRs before.
Example lifecycle cases
Case 1: Bot container is up, dial fails
Dialing is something to be retried and it is already retried for a while. If that finally fails, then something must be wrong with the bot.
Case 2: Bot container is up, dial succeeds, initialization fails
Initialization is async now and it doesn't lag/delay anything. We need to try initalization many times. If it finally fails, then there is nothing the scanner can do as something must be wrong with the bot.
Case 3: Bot container is up but then it goes down in a second
Dial will fail. Bot manager will detect the exited container, restart, publish a "restarted" message so the bot client can do async initialize (dial + initialize())
Case 4: Bot container was up but then it crashed after 10 minutes
Dial and initialize() succeeded at the first time of launching. This is handled the same as case 3.
Case 5: Bot container calls are timing out
The container must be down and the network interface should be unavailable. If the bot container is down, we restart, redial and reinitialize.
Case 6: Bot container is shut down because it is unassigned
Bot manager tells the bot pool that the bot is removed and waits for a few seconds to give chance to the bot pool to catch the message. If the bot pool cannot catch and the container goes down, this will only cause some error messages in the forta-scanner container and it does not cause any runtime failures.
Case 7: Bot container calls are causing "connection refused"
The bot container is up and the connection is refused because the bot code threw and error, the gRPC server in the bot is not running anymore and the container did not exit. The bot must ensure that the container is exited so this case can be handled like case 3 or 4.
Case 8: Bot settings are updated
There is a safe call to the bot client in the bot pool to quickly replace the config.
Case 9: Bot download is slow and it failed
We let that fail and we retry that after the next 15 seconds. It is excluded from the running bots.
Case 10: Bot download succeeded but container start is failing
This is different than case 3 as it is a technical problem with the container configuration. We handle this the same as case 9.
Background
This is a refactor to remedy the bot lifecycle problems experienced recently. It is based on three things:
Review intro
Dive into the architecture under
services/components
directory:lifecycle
package: Bot manager manages lifecycle and calls the bot pool through a mediator (i.e. nats)agent.versions.latest
is removed and onlyagent.status.running
is used. There is a newagent.status.stopping
for about-to-be-stopped containers.botio
packagepoolagent.Agent
and the bot pool inlifecycle
package manages these, as before. It dials and initializes inInitialize()
and deals with its own combiner subscriptions.SendEvaluateXXXRequest
methods. It should look familiar.registry
package: This is not a separate service anymore and is something called by the bot manager, in forta-supervisor.containers
packagelifecycle
package). It includes a brief version of the bot container creation logic from supervisor and some extra methods to manage containers easily.metrics
package: I moved this to the service components as it is relevant to the services and I added an implementation.bot.lifecycleMetrics.FailureDial(botConfig)
.forta-supervisor
Now it runs the bot manager which is in the
lifecycle
package and callsbotManager.ManageBots()
,botManager.RestartExitedBots()
every 15 seconds. This causes running, stopping, restarting containers and bot lists are published. We keep on publishing the latest list of running bots every 15s to enforce corrective behaviour.forta-scanner
This runs the bot pool which is in the
lifecycle
package and connects to bot manager through the mediator (nats) so the bot manager can call the bot pool with bot config update messages.New bot clients are created for the new containers and added to pool. Shut-down containers' clients are removed. Initialize is safe to call multiple times and the gRPC client is replaced each time in a thread safe way so we just call initialize again for the restarted containers. Added a TODO to retry initialize multiple times later, as we did in one of the PRs before.
Example lifecycle cases
Case 1: Bot container is up, dial fails
Dialing is something to be retried and it is already retried for a while. If that finally fails, then something must be wrong with the bot.
Case 2: Bot container is up, dial succeeds, initialization fails
Initialization is async now and it doesn't lag/delay anything. We need to try initalization many times. If it finally fails, then there is nothing the scanner can do as something must be wrong with the bot.
Case 3: Bot container is up but then it goes down in a second
Dial will fail. Bot manager will detect the exited container, restart, publish a "restarted" message so the bot client can do async initialize (dial +
initialize()
)Case 4: Bot container was up but then it crashed after 10 minutes
Dial and
initialize()
succeeded at the first time of launching. This is handled the same as case 3.Case 5: Bot container calls are timing out
The container must be down and the network interface should be unavailable. If the bot container is down, we restart, redial and reinitialize.
Case 6: Bot container is shut down because it is unassigned
Bot manager tells the bot pool that the bot is removed and waits for a few seconds to give chance to the bot pool to catch the message. If the bot pool cannot catch and the container goes down, this will only cause some error messages in the forta-scanner container and it does not cause any runtime failures.
Case 7: Bot container calls are causing "connection refused"
The bot container is up and the connection is refused because the bot code threw and error, the gRPC server in the bot is not running anymore and the container did not exit. The bot must ensure that the container is exited so this case can be handled like case 3 or 4.
Case 8: Bot settings are updated
There is a safe call to the bot client in the bot pool to quickly replace the config.
Case 9: Bot download is slow and it failed
We let that fail and we retry that after the next 15 seconds. It is excluded from the running bots.
Case 10: Bot download succeeded but container start is failing
This is different than case 3 as it is a technical problem with the container configuration. We handle this the same as case 9.