[bug | high priority] READY vs GUILD_CREATE race condition makes guild cache forever unavailable

curz46 commented 4 years ago

Description

Alchemy's event module is asynchronous, so two events received at different times can have Alchemy's internal handlers for them executed simultaneously. The guild cache is actually a group of GenServers, exactly one for each guild, storing the state of its guild. A problem arises when you consider how the cache is started:

READY's handler spawns a GenServer for each guild cache, setting it to: %{"id" => "...", "unavailable" => true}
GUILD_CREATE's handler checks if the GenServer for the relevant guild is running. If it is, then it replaces the cache with the incoming data. If it isn't, then it starts the GenServer with the new data.

Currently, it is very often the case that READY will finish starting all of the guilds' GenServers after GUILD_CREATE does, since they run simultaneously. This means that GUILD_CREATE causes the server to be started since it is not already, then filled with the guild's data. READY then completes its execution, starting the guilds' GenServers without checking if they are already running. This restarts the GenServers and sets them all as unavailable. There is no check in place to see if guilds are still unavailable after X seconds so it remains like this in guild cache forever, with updates to the cache through events just bouncing off of it since there is no structure to it.

Why not just change `READY`'s handler to only start the guild cache if it is not already started?

Consider this timeline of events:

1. READY is received and forwarded to event handlers async.
2. GUILD_CREATE is received and forwarded to event handlers async.
3. READY's event handler begins execution
4. GUILD_CREATE's event handler begins execution
5. READY's event handler checks if the GenServer is already running. It isn't
6. GUILD_CREATE's event handler starts the server with the new data, finishing execution
7. READY's event handler's check succeeded, so the GenServer is restarted as unavailable
8. The problem persists. The guild cache is now set as an unavailable placeholder forever

Proposed Solutions

Assume that READY will always be received first (probably a fair assumption). Block the event dispatcher when READY is received by executing the handler synchronously. (Impl. here) (Credit)
Rewrite the guild cache to work with a single GenServer, using one big map for all guilds. This has performance implications, because all operations on the guild cache are on a single process. Potentially significant when a bot is in a large number of guilds.
Start the guild cache GenServer only in response to a GUILD_CREATE event. Both guilds the bot is not in and guilds that are unavailable (whether that is at launch or otherwise) do not have GenServers running. Determining whether or not the bot user is in a guild can be done through a list of guild snowflakes managed independently of the guild cache.

Dylan

OvermindDL1 commented 4 years ago

I've not looked at the structure, but perhaps the guild cache would work well in ETS instead? That way it would it could work in parallel and be faster than just about any other case as well.

cronokirby commented 4 years ago

I think we should go ahead and integrate 1 as a quick fix to this bug, and then work on moving the cache to ETS or a single process to avoid resorting to workarounds like this.

cronokirby / alchemy