Open kevinlul opened 2 years ago
There should now be RAM capacity for this.
Signal mechanism is insufficient in case of container restarts, since there will be nothing to signal the EventLocker off. Should add a timeout from boot instead.
https://github.com/Wowu/docker-rollout New script that's very well done. This can be incorporated to start the new container before removing the first under Compose, without switching to a single-node Swarm. Installation should be version-fixed with checksum like docker-stack-wait
for Swarm.
(Optional) Add a third CLI parameter to specify the lock database location. Its presence enables EventLocker. This allows specifying a separate tmpfs, so there's never a disk write. To share the lock database between Docker containers, we can't use Docker's tmpfs
volume type as it can't be shared. Bind mount the host's /dev/shm
.
To properly assess the state of the containers, a healthcheck is needed. Create an HTTP healthcheck using built-in node:http
, responding 200 OK if and only if all Discord bot shards are ready. This HTTP server may listen on a TCP port or a Unix socket in /run
or /tmp
. The Dockerfile should have a healthcheck for this endpoint. The additional timeout EventLocker shutoff mechanism should only start once all shards are ready, if possible.
Alternate concept based on https://github.com/meister03/discord-hybrid-sharding:
Add a CLI parameter that starts the bot in standby, with the current timestamp as the value. When the bot is started in standby, only a base set of event listeners are registered (warn, error, shard&*, ready): https://github.com/DawnbrandBots/bastion-bot/blob/master/src/bot.ts
Since the bot program will start in standby, the new instance will not start handling events while the old instance is still up. Once the ready event is emitted, the deployment system can detect this and simultaneously signal the new bot to become active while shutting off the old bot. The new bot becomes active by registering all remaining event listeners (guildCreate, guildDelete, listeners array containing interaction, messageCreate, messageDelete). At this point, the takeover and deployment is complete.
Should the bot process crash or otherwise be restarted, it can calculate its start time and compare this with the timestamp in the CLI parameter, If the delay is too high, then it knows it has been restarted and can start up in active mode instead of hanging in standby when there's no deployment system.
A new release to the live Bastion could cause up to a minute of downtime, rounding up, due to stopping the bot process and then starting up a new process, which reconnects shard-by-shard to the Discord gateway to not be rate-limited. To have no downtime, deployments need to start up the new process first, make sure that no duplicate responses happen in the deployment period while there are two processes running, then stop the old process once the new process has connected all shards to the gateway. Two things must be implemented for this to happen: container start before stop and a lock manager.
Container start before stop
In Swarm, this is configured as
deploy.update_config.order: start-first
(docs). This is not supported by Compose v1 or v2. Therefore, to use this, production must be switched to a single-node Swarm. (Not yet at the scale where additional benefits are reaped from separating shards into their own processes.)A pure Compose solution could be to use different project names with each deployment, as long as previous project names are kept track of so the old stack can be taken down.
Lock manager
Since Bastion is not yet at the scale for sharding across multiple hosts, the fastest solution should be write-ahead-log SQLite. Before processing a message or interaction, attempt to INSERT the snowflake into a table. Continue only if this succeeds, as we have the lock. If not, a different process has taken the lock. In the general case, this kind of overhead could help with Discord eventual consistency, though this has never been a problem in practice (receiving an event N times). This overhead could also be limited to the deployment window, being toggled upon receiving a certain Unix signal.
Caveats
VM memory demands increase since it must support two bot Node.js containers running during the deployment window. The addition of <> card search has already increased memory demands.
In general (not just the zero-downtime case), how button timeouts behave across a redeployment or a bot restart should be considered.