cronokirby / alchemy

A discord library for Elixir
MIT License
151 stars 34 forks source link

[bug | high priority] READY vs GUILD_CREATE race condition makes guild cache forever unavailable #79

Closed curz46 closed 4 years ago

curz46 commented 4 years ago

Description

Alchemy's event module is asynchronous, so two events received at different times can have Alchemy's internal handlers for them executed simultaneously. The guild cache is actually a group of GenServers, exactly one for each guild, storing the state of its guild. A problem arises when you consider how the cache is started:

Currently, it is very often the case that READY will finish starting all of the guilds' GenServers after GUILD_CREATE does, since they run simultaneously. This means that GUILD_CREATE causes the server to be started since it is not already, then filled with the guild's data. READY then completes its execution, starting the guilds' GenServers without checking if they are already running. This restarts the GenServers and sets them all as unavailable. There is no check in place to see if guilds are still unavailable after X seconds so it remains like this in guild cache forever, with updates to the cache through events just bouncing off of it since there is no structure to it.

Why not just change READY's handler to only start the guild cache if it is not already started?

Consider this timeline of events:

1. READY is received and forwarded to event handlers async.
2. GUILD_CREATE is received and forwarded to event handlers async.
3. READY's event handler begins execution
4. GUILD_CREATE's event handler begins execution
5. READY's event handler checks if the GenServer is already running. It isn't
6. GUILD_CREATE's event handler starts the server with the new data, finishing execution
7. READY's event handler's check succeeded, so the GenServer is restarted as unavailable
8. The problem persists. The guild cache is now set as an unavailable placeholder forever

Proposed Solutions

  1. Assume that READY will always be received first (probably a fair assumption). Block the event dispatcher when READY is received by executing the handler synchronously. (Impl. here) (Credit)
  2. Rewrite the guild cache to work with a single GenServer, using one big map for all guilds. This has performance implications, because all operations on the guild cache are on a single process. Potentially significant when a bot is in a large number of guilds.
  3. Start the guild cache GenServer only in response to a GUILD_CREATE event. Both guilds the bot is not in and guilds that are unavailable (whether that is at launch or otherwise) do not have GenServers running. Determining whether or not the bot user is in a guild can be done through a list of guild snowflakes managed independently of the guild cache.

Dylan

OvermindDL1 commented 4 years ago

I've not looked at the structure, but perhaps the guild cache would work well in ETS instead? That way it would it could work in parallel and be faster than just about any other case as well.

cronokirby commented 4 years ago

I think we should go ahead and integrate 1 as a quick fix to this bug, and then work on moving the cache to ETS or a single process to avoid resorting to workarounds like this.