Member cache not being filled

Rapptz / discord.py

An API wrapper for Discord written in Python.

http://discordpy.rtfd.org/en/latest

MIT License

14.75k stars 3.75k forks source link

Member cache not being filled #9571

Closed mniip closed 9 months ago

mniip commented 1 year ago

Summary

Member cache broken if a disconnect happens in the middle of guild chunking.

Reproduction Steps

Have a non-resume reconnect, and while the bot is loading the members of a large guild, trigger another non-resume reconnect.

Minimal Reproducible Code

No response

Expected Results

Bot starts chunking the guild again so that eventually it has its members cached.

Actual Results

The guild is permanently stuck in a broken state in which the bot doesn't try to chunk its members upon any successive reconnect.

Intents

All except presence

System Information

- Python v3.9.1-final
- discord.py v2.3.2-final
- aiohttp v3.8.1
- system info: Linux 4.19.0-20-cloud-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17)

Checklist

[X] I have searched the open issues for duplicates.
[X] I have shown the entire traceback, if possible.
[X] I have removed my token from display, if visible.

Additional Context

I'm doing CPU and memory-intensive work after on_ready, which can cause the bot to lag so much it disconnects (due to swapping), and also in general amplifies any timing issues.

mniip commented 1 year ago

Got some (reasonably redacted) logs from when it happens: log.txt There are two clients in this process, the second client doesn't have the guilds intent, so pay no attention to it.

After a non-resume reconnect at 2023-09-21T15:16:35.561 the library starts chunking a guild. This takes a while so at 2023-09-21T15:17:03.056 it gives up, though we still receive a few. Of note is that the chunk we receive last, at 2023-09-21T15:18:45.620, is not the final chunk: 'chunk_index': 167, 'chunk_count': 170.

Then another non-resume reconnect happens at 2023-09-21T15:18:49.405. And we receive the GUILD_CREATE at 2023-09-21T15:18:57.318. However (ignore the ready at 2023-09-21T15:18:59.779, it's from that second client), we do not receive any GUILD_MEMBERS_CHUNK and the library times out at 2023-09-21T15:19:16.906.

I did not have enable_debug_events so it didn't log whether it send a chunk request or not, but I have a strong suspicion that it didn't.

After a cursory reading of ConnectionState, it seems to me that _chunk_requests can have stale guilds in it if a disconnect happens in the middle of chunking, and such a stale guild will cause a fresh chunk request to never be sent at any future reconnect.

mniip commented 1 year ago

Minimal reproducer:

import asyncio
import logging
import discord
import json
import time

discord.utils.setup_logging(level=logging.INFO)

intents = discord.Intents.all()
intents.presences = False
client = discord.Client(intents=intents, enable_debug_events=True)

async def trigger_reconnect():
    with open("/dev/null", "w") as null:
        proc = await asyncio.create_subprocess_exec("sudo", "tcpkill", "port 443", stdin=null, stdout=null, stderr=null)
        try:
            await asyncio.wait_for(proc.wait(), timeout=5)
        except asyncio.TimeoutError:
            proc.terminate()

first = True
@client.event
async def on_socket_raw_receive(msg):
    global first
    data = json.loads(msg)
    if data["t"] == "GUILD_MEMBERS_CHUNK":
        logging.info(f"Chunk: {data['d']['chunk_index']} / {data['d']['chunk_count']}")

        if first:
            first = False
            await trigger_reconnect()

@client.event
async def on_socket_raw_send(msg):
    data = json.loads(msg)
    if data["op"] == discord.gateway.DiscordWebSocket.REQUEST_MEMBERS:
        logging.info(f"Requesting chunks for {data['d']['guild_id']}")

second = True
@client.event
async def on_guild_available(guild):
    global second
    logging.info(f"Guild available {guild.id}: {len(guild.members)} / {guild.member_count}")

    if second:
        second = False
        await trigger_reconnect()

client.run("...", log_handler=None)

bijij commented 10 months ago

Not much thought into race conditions or what not, but can we get away with just clearing all the chunk requests for a given shard on disconnect?

diff --git a/discord/shard.py b/discord/shard.py
index 5495511a..30404abb 100644
--- a/discord/shard.py
+++ b/discord/shard.py
@@ -132,6 +132,7 @@ class Shard:
     async def _handle_disconnect(self, e: Exception) -> None:
         self._dispatch('disconnect')
         self._dispatch('shard_disconnect', self.id)
+        self._client._connection.clear_chunk_requests(self.id)
         if not self._reconnect:
             self._queue_put(EventItem(EventType.close, self, e))
             return
diff --git a/discord/state.py b/discord/state.py
index 22b0e74e..ed50b50f 100644
--- a/discord/state.py
+++ b/discord/state.py
@@ -110,12 +110,14 @@ class ChunkRequest:
     def __init__(
         self,
         guild_id: int,
+        shard_id: int,
         loop: asyncio.AbstractEventLoop,
         resolver: Callable[[int], Any],
         *,
         cache: bool = True,
     ) -> None:
         self.guild_id: int = guild_id
+        self.shard_id: int = shard_id
         self.resolver: Callable[[int], Any] = resolver
         self.loop: asyncio.AbstractEventLoop = loop
         self.cache: bool = cache
@@ -315,6 +317,16 @@ class ConnectionState(Generic[ClientT]):
         for key in removed:
             del self._chunk_requests[key]

+    def clear_chunk_requests(self, shard_id: int | None) -> None:
+        removed = []
+        for key, request in self._chunk_requests.items():
+            if shard_id is None or request.shard_id == shard_id:
+                request.done()
+                removed.append(key)
+
+        for key in removed:
+            del self._chunk_requests[key]
+
     def call_handlers(self, key: str, *args: Any, **kwargs: Any) -> None:
         try:
             func = self.handlers[key]
@@ -535,7 +547,7 @@ class ConnectionState(Generic[ClientT]):
         if ws is None:
             raise RuntimeError('Somehow do not have a websocket for this guild_id')

-        request = ChunkRequest(guild.id, self.loop, self._get_guild, cache=cache)
+        request = ChunkRequest(guild.id, guild.shard_id, self.loop, self._get_guild, cache=cache)
         self._chunk_requests[request.nonce] = request

         try:
@@ -1204,7 +1216,13 @@ class ConnectionState(Generic[ClientT]):
         cache = cache or self.member_cache_flags.joined
         request = self._chunk_requests.get(guild.id)
         if request is None:
-            self._chunk_requests[guild.id] = request = ChunkRequest(guild.id, self.loop, self._get_guild, cache=cache)
+            self._chunk_requests[guild.id] = request = ChunkRequest(
+                guild.id,
+                guild.shard_id,
+                self.loop,
+                self._get_guild,
+                cache=cache,
+            )
             await self.chunker(guild.id, nonce=request.nonce)

         if wait:

bijij commented 9 months ago

@Rapptz unless you have a better idea I'll PR this tomorrow. My bot got caught by it today so it's fresh in my mind.

mniip commented 9 months ago

My thought was to simply clear ConnectionState._chunk_requests right next to where ConnectionState._ready_state is cleared in ConnectionState.parse_ready.

I suppose if you call Guild.query_members your async threads will be stuck on dead futures, so you do have to fullfill them (as you do in your code).

However putting this in Shard._handle_disconnect seems problematic because at that state we could still have a RESUME, which might fullfill the chunk requests.

mniip commented 9 months ago

Thought: what about calling Future.set_exception on ChunkRequest.waiters?

mniip commented 9 months ago

Fixed in d25b574b3240e4490af6dd694e1d0b9ac30cff3e