Closed mniip closed 9 months ago
Got some (reasonably redacted) logs from when it happens: log.txt There are two clients in this process, the second client doesn't have the guilds intent, so pay no attention to it.
After a non-resume reconnect at 2023-09-21T15:16:35.561
the library starts chunking a guild. This takes a while so at 2023-09-21T15:17:03.056
it gives up, though we still receive a few. Of note is that the chunk we receive last, at 2023-09-21T15:18:45.620
, is not the final chunk: 'chunk_index': 167, 'chunk_count': 170
.
Then another non-resume reconnect happens at 2023-09-21T15:18:49.405
. And we receive the GUILD_CREATE
at 2023-09-21T15:18:57.318
. However (ignore the ready
at 2023-09-21T15:18:59.779
, it's from that second client), we do not receive any GUILD_MEMBERS_CHUNK
and the library times out at 2023-09-21T15:19:16.906
.
I did not have enable_debug_events
so it didn't log whether it send a chunk request or not, but I have a strong suspicion that it didn't.
After a cursory reading of ConnectionState
, it seems to me that _chunk_requests
can have stale guilds in it if a disconnect happens in the middle of chunking, and such a stale guild will cause a fresh chunk request to never be sent at any future reconnect.
Minimal reproducer:
import asyncio
import logging
import discord
import json
import time
discord.utils.setup_logging(level=logging.INFO)
intents = discord.Intents.all()
intents.presences = False
client = discord.Client(intents=intents, enable_debug_events=True)
async def trigger_reconnect():
with open("/dev/null", "w") as null:
proc = await asyncio.create_subprocess_exec("sudo", "tcpkill", "port 443", stdin=null, stdout=null, stderr=null)
try:
await asyncio.wait_for(proc.wait(), timeout=5)
except asyncio.TimeoutError:
proc.terminate()
first = True
@client.event
async def on_socket_raw_receive(msg):
global first
data = json.loads(msg)
if data["t"] == "GUILD_MEMBERS_CHUNK":
logging.info(f"Chunk: {data['d']['chunk_index']} / {data['d']['chunk_count']}")
if first:
first = False
await trigger_reconnect()
@client.event
async def on_socket_raw_send(msg):
data = json.loads(msg)
if data["op"] == discord.gateway.DiscordWebSocket.REQUEST_MEMBERS:
logging.info(f"Requesting chunks for {data['d']['guild_id']}")
second = True
@client.event
async def on_guild_available(guild):
global second
logging.info(f"Guild available {guild.id}: {len(guild.members)} / {guild.member_count}")
if second:
second = False
await trigger_reconnect()
client.run("...", log_handler=None)
Not much thought into race conditions or what not, but can we get away with just clearing all the chunk requests for a given shard on disconnect?
diff --git a/discord/shard.py b/discord/shard.py
index 5495511a..30404abb 100644
--- a/discord/shard.py
+++ b/discord/shard.py
@@ -132,6 +132,7 @@ class Shard:
async def _handle_disconnect(self, e: Exception) -> None:
self._dispatch('disconnect')
self._dispatch('shard_disconnect', self.id)
+ self._client._connection.clear_chunk_requests(self.id)
if not self._reconnect:
self._queue_put(EventItem(EventType.close, self, e))
return
diff --git a/discord/state.py b/discord/state.py
index 22b0e74e..ed50b50f 100644
--- a/discord/state.py
+++ b/discord/state.py
@@ -110,12 +110,14 @@ class ChunkRequest:
def __init__(
self,
guild_id: int,
+ shard_id: int,
loop: asyncio.AbstractEventLoop,
resolver: Callable[[int], Any],
*,
cache: bool = True,
) -> None:
self.guild_id: int = guild_id
+ self.shard_id: int = shard_id
self.resolver: Callable[[int], Any] = resolver
self.loop: asyncio.AbstractEventLoop = loop
self.cache: bool = cache
@@ -315,6 +317,16 @@ class ConnectionState(Generic[ClientT]):
for key in removed:
del self._chunk_requests[key]
+ def clear_chunk_requests(self, shard_id: int | None) -> None:
+ removed = []
+ for key, request in self._chunk_requests.items():
+ if shard_id is None or request.shard_id == shard_id:
+ request.done()
+ removed.append(key)
+
+ for key in removed:
+ del self._chunk_requests[key]
+
def call_handlers(self, key: str, *args: Any, **kwargs: Any) -> None:
try:
func = self.handlers[key]
@@ -535,7 +547,7 @@ class ConnectionState(Generic[ClientT]):
if ws is None:
raise RuntimeError('Somehow do not have a websocket for this guild_id')
- request = ChunkRequest(guild.id, self.loop, self._get_guild, cache=cache)
+ request = ChunkRequest(guild.id, guild.shard_id, self.loop, self._get_guild, cache=cache)
self._chunk_requests[request.nonce] = request
try:
@@ -1204,7 +1216,13 @@ class ConnectionState(Generic[ClientT]):
cache = cache or self.member_cache_flags.joined
request = self._chunk_requests.get(guild.id)
if request is None:
- self._chunk_requests[guild.id] = request = ChunkRequest(guild.id, self.loop, self._get_guild, cache=cache)
+ self._chunk_requests[guild.id] = request = ChunkRequest(
+ guild.id,
+ guild.shard_id,
+ self.loop,
+ self._get_guild,
+ cache=cache,
+ )
await self.chunker(guild.id, nonce=request.nonce)
if wait:
@Rapptz unless you have a better idea I'll PR this tomorrow. My bot got caught by it today so it's fresh in my mind.
My thought was to simply clear ConnectionState._chunk_requests
right next to where ConnectionState._ready_state
is cleared in ConnectionState.parse_ready
.
I suppose if you call Guild.query_members
your async threads will be stuck on dead futures, so you do have to fullfill them (as you do in your code).
However putting this in Shard._handle_disconnect
seems problematic because at that state we could still have a RESUME
, which might fullfill the chunk requests.
Thought: what about calling Future.set_exception
on ChunkRequest.waiters
?
Fixed in d25b574b3240e4490af6dd694e1d0b9ac30cff3e
Summary
Member cache broken if a disconnect happens in the middle of guild chunking.
Reproduction Steps
Have a non-resume reconnect, and while the bot is loading the members of a large guild, trigger another non-resume reconnect.
Minimal Reproducible Code
No response
Expected Results
Bot starts chunking the guild again so that eventually it has its members cached.
Actual Results
The guild is permanently stuck in a broken state in which the bot doesn't try to chunk its members upon any successive reconnect.
Intents
All except presence
System Information
Checklist
Additional Context
I'm doing CPU and memory-intensive work after
on_ready
, which can cause the bot to lag so much it disconnects (due to swapping), and also in general amplifies any timing issues.