forta-network / forta-node

Scan Node software for the Forta Network
https://forta.org
Other
79 stars 151 forks source link

Bot lifecycle management refactor #726

Closed canercidam closed 1 year ago

canercidam commented 1 year ago

Background

This is a refactor to remedy the bot lifecycle problems experienced recently. It is based on three things:

Review intro

Dive into the architecture under services/components directory:

forta-supervisor

Now it runs the bot manager which is in the lifecycle package and calls botManager.ManageBots(), botManager.RestartExitedBots() every 15 seconds. This causes running, stopping, restarting containers and bot lists are published. We keep on publishing the latest list of running bots every 15s to enforce corrective behaviour.

forta-scanner

This runs the bot pool which is in the lifecycle package and connects to bot manager through the mediator (nats) so the bot manager can call the bot pool with bot config update messages.

New bot clients are created for the new containers and added to pool. Shut-down containers' clients are removed. Initialize is safe to call multiple times and the gRPC client is replaced each time in a thread safe way so we just call initialize again for the restarted containers. Added a TODO to retry initialize multiple times later, as we did in one of the PRs before.

Example lifecycle cases

Case 1: Bot container is up, dial fails

Dialing is something to be retried and it is already retried for a while. If that finally fails, then something must be wrong with the bot.

Case 2: Bot container is up, dial succeeds, initialization fails

Initialization is async now and it doesn't lag/delay anything. We need to try initalization many times. If it finally fails, then there is nothing the scanner can do as something must be wrong with the bot.

Case 3: Bot container is up but then it goes down in a second

Dial will fail. Bot manager will detect the exited container, restart, publish a "restarted" message so the bot client can do async initialize (dial + initialize())

Case 4: Bot container was up but then it crashed after 10 minutes

Dial and initialize() succeeded at the first time of launching. This is handled the same as case 3.

Case 5: Bot container calls are timing out

The container must be down and the network interface should be unavailable. If the bot container is down, we restart, redial and reinitialize.

Case 6: Bot container is shut down because it is unassigned

Bot manager tells the bot pool that the bot is removed and waits for a few seconds to give chance to the bot pool to catch the message. If the bot pool cannot catch and the container goes down, this will only cause some error messages in the forta-scanner container and it does not cause any runtime failures.

Case 7: Bot container calls are causing "connection refused"

The bot container is up and the connection is refused because the bot code threw and error, the gRPC server in the bot is not running anymore and the container did not exit. The bot must ensure that the container is exited so this case can be handled like case 3 or 4.

Case 8: Bot settings are updated

There is a safe call to the bot client in the bot pool to quickly replace the config.

Case 9: Bot download is slow and it failed

We let that fail and we retry that after the next 15 seconds. It is excluded from the running bots.

Case 10: Bot download succeeded but container start is failing

This is different than case 3 as it is a technical problem with the container configuration. We handle this the same as case 9.