Closed Kaisarion closed 1 year ago
There were some WebSocket fixes in our development version. Can you try and see if you can reproduce in the development release?
I'm still able to reproduce this on Development. Exact same behavior. Only thing that stood out to me in stable logs was this about 5-6 times:
[WS => Shard 0] [HeartbeatTimer] Didn't receive a heartbeat ack last time, assuming zombie connection. Destroying and reconnecting.
Status : 0
Sequence : 6049
Connection State: OPEN
[WS => Shard 0] [DESTROY]
Close Code : 4009
Reset : true
Emit DESTROYED: true
[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
[WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
[WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
Timeout: 5000ms
[WS => Shard 0] [WebSocket] Clearing the close timeout.
[WS => Shard 0] [CLOSE]
Event Code: 1006
Clean : false
Reason :
[WS => Shard 0] [CONNECT]
Gateway : wss://gateway.discord.gg/
Version : 10
Encoding : json
Compression: none
[WS => Shard 0] Setting a HELLO timeout for 20s.
[WS => Shard 0] [CONNECTED] Took 222ms
[WS => Shard 0] Clearing the HELLO timeout.
Heartbeat was never acknowledged.
Looks like shard 0 doesn’t get any replies on its websocket connection once shard 1 starts its websocket. Could be an issue with your docker configuration. What do the debug logs of shard 1 look like? Any double acknowledgments there or anything else indicating that shard 1 receives ws messages meant for shard 0?
Doesn't look like there's a problem. Here's a few thousand lines of websockets, none of which show something out of the ordinary. https://gist.github.com/Kaisarion/0bde85c7f0f7aa4a62323d1857a4265a
I tried without docker to no avail. Happy to shoot over my docker-compose / Dockerfile
That clearly shows that shard 0 is fine sending and receiving heartbeats but will stop receiving them when shard 1 starts to send one. So that definitely looks like some kind of routing issue. Do you use process or worker mode in ShardingManager?
Manager itself does not appear to be either. See below.
import { Shard, ShardingManager } from 'discord.js';
import { createRequire } from 'node:module';
import { JobService, Logger } from '../services/index.js';
const require = createRequire(import.meta.url);
let Config = require('../../config/config.json');
let Debug = require('../../config/debug.json');
let Logs = require('../../lang/logs.json');
export class Manager {
constructor(private shardManager: ShardingManager, private jobService: JobService) {}
public async start(): Promise<void> {
this.registerListeners();
let shardList = this.shardManager.shardList as number[];
try {
Logger.info(
Logs.info.managerSpawningShards
.replaceAll('{SHARD_COUNT}', shardList.length.toLocaleString())
.replaceAll('{SHARD_LIST}', shardList.join(', '))
);
await this.shardManager.spawn({
amount: this.shardManager.totalShards,
delay: Config.sharding.spawnDelay * 1000,
timeout: Config.sharding.spawnTimeout * 1000,
});
Logger.info(Logs.info.managerAllShardsSpawned);
} catch (error) {
Logger.error(Logs.error.managerSpawningShards, error);
return;
}
if (Debug.dummyMode.enabled) {
return;
}
this.jobService.start();
}
private registerListeners(): void {
this.shardManager.on('shardCreate', shard => this.onShardCreate(shard));
}
private onShardCreate(shard: Shard): void {
Logger.info(Logs.info.managerLaunchedShard.replaceAll('{SHARD_ID}', shard.id.toString()));
}
}
same here in jobservice
import schedule from 'node-schedule';
import { createRequire } from 'node:module';
import { Job } from '@/jobs';
import { Logger } from '@/services';
const require = createRequire(import.meta.url);
let Logs = require('../../lang/logs.json');
export class JobService {
constructor(private jobs: Job[]) {}
public start(): void {
for (let job of this.jobs) {
schedule.scheduleJob(job.schedule, async () => {
try {
if (job.log) {
Logger.info(Logs.info.jobRun.replaceAll('{JOB}', job.name));
}
await job.run();
if (job.log) {
Logger.info(Logs.info.jobCompleted.replaceAll('{JOB}', job.name));
}
} catch (error) {
Logger.error(Logs.error.job.replaceAll('{JOB}', job.name), error);
}
});
Logger.info(Logs.info.jobScheduled.replaceAll('{JOB}', job.name).replaceAll('{SCHEDULE}', job.schedule));
}
}
}
giving this issue a nudge. again, using process sharding on the singular server docker process.
hiya, one more nudge. not sure how to proceed here @Jiralite
just circling back here, some more notable things to do with this:
[WS => Shard 0] [HeartbeatTimer] Didn't receive a heartbeat ack last time, assuming zombie connection. Destroying and reconnecting.
Status : 0
Sequence : 5558
Connection State: OPEN
[WS => Shard 0] [DESTROY]
Close Code : 4009
Reset : true
Emit DESTROYED: true
[WS => Shard 0] Clearing the heartbeat interval.
[WS => Shard 0] [WebSocket] Destroy: Attempting to close the WebSocket. | WS State: OPEN
[WS => Shard 0] [WebSocket] Close: Tried closing. | WS State: CLOSING
[WS => Shard 0] [WebSocket] Adding a WebSocket close timeout to ensure a correct WS reconnect.
Timeout: 5000ms
and one with shard 1:
[WS => Shard 1] Heartbeat acknowledged, latency of 16ms. [WS => Shard 0] Shard did not receive any more guild packets in 15000 ms. Unavailable guild count: 1 [WS => Shard 0] [HeartbeatTimer] Sending a heartbeat.
Since 2 days, I am also facing the same problem. Bot randomly went offline and there was nothing debug logs. I am on v14.9.0
We will only address WebSocket-related issues on 14.10.0 and above. In 14.10.0, the code changed greatly, so I'm afraid all we can tell you is to update (which you should be doing anyway). 14.11.0 is currently the latest stable version which includes a more stable WebSocket.
I believe I am facing a similar issue. I was also using DJS v14.9 but I upgraded to v14.11 and the issue still persists. This error appears to occur after my bot has been online for some hours. No idea what's causing it, no error is emitted.
Those errors clearly indicate you‘re not on 14.11.0 but at most 14.9.x, because they are debug logs of the old websocket implementation and not the one we use since 14.10.0. So please update too.
I have updated, if the problem persists then I would inform
Those errors clearly indicate you‘re not on 14.11.0 but at most 14.9.x, because they are debug logs of the old websocket implementation and not the one we use since 14.10.0. So please update too.
You are correct. That was done on purpose as in 14.11.0, no close event can be seen in the logs but the crashing error still persists without an error. It instead appears to be shown as being told to reconnect by Discord. It's very possible that the issue I am experiencing is separate from this one though, so please let me know if that is the case.
Are either of you using any third-party libraries related to sharding?
An example could be discord-hybrid-sharding.
Are either of you using any third-party libraries related to sharding?
Not in my case, since I restarted the bot (after updating), things looks pretty good now.
Are either of you using any third-party libraries related to sharding?
An example could be discord-hybrid-sharding.
I am using discord-hybrid-sharding. I've talked to them multiple times about the issues I've been having and they've told me repeatedly that it isn't an issue on their end 🤷♂️
I ask because we're seeing people in support having these issues that are also using discord-hybrid-sharding...
I am using discord-hybrid-sharding. I've talked to them multiple times about the issues I've been having and they've told me repeatedly that it isn't an issue on their end 🤷♂️
Considering that no one else is running into this as far as I can tell (standalone /ws users, some of which have MASSIVE bots), or even the general userbase now that it's in the main discord.js
library, it is definitely an issue on their end.
If you're able to reproduce this problem without their package, feel free to return to the issue tracker with more info.
26 Hours since updating Discord.js, no issues found. Bot is running smoothly
Are either of you using any third-party libraries related to sharding? An example could be discord-hybrid-sharding.
I am using discord-hybrid-sharding. I've talked to them multiple times about the issues I've been having and they've told me repeatedly that it isn't an issue on their end 🤷♂️
You have neither asked us via dm or contacted us in the corresponding support server. I am able to reproduce a ws issue without using hybrid-sharding, posted it in the #djs-in-dev channel.
Furthermore I examined the help channels of djs and just saw two issues mentioning hybrid-sharding, which have totally different errors.... (code issue on users side)
26 Hours since updating Discord.js, no issues found. Bot is running smoothly
what version of DJS are you running & how many shards? just curious.
Also: My issue was caused partially by DJS, yes. But it was also a presence overload issue. I recommend all devs working with sharded presences to proceed with caution when writing your code.
26 Hours since updating Discord.js, no issues found. Bot is running smoothly
what version of DJS are you running & how many shards? just curious.
Also: My issue was caused partially by DJS, yes. But it was also a presence overload issue. I recommend all devs working with sharded presences to proceed with caution when writing your code.
Djs Version - v14.11.0 Shards Count - 2
My RAM usage below 900 MB (used sweepers to clear cache) Till now, bot is running well
@Jiralite I got something odd today, bot was fine since more than 48 hours, then I en countered this in my logs:
0|Raiden | 6:43:39 >> [WS => Shard 0] The gateway closed with an unexpected code 1006, attempting to resume.
0|Raiden | 6:43:39 >> [WS => Shard 0] Destroying shard
0|Raiden | Reason: none
0|Raiden | Code: 1006
0|Raiden | Recover: Resume
0|Raiden | 6:43:39 >> [WS => Shard 0] Connection status during destroy
0|Raiden | Needs closing: false
0|Raiden | Ready state: 3
0|Raiden | 6:43:40 >> [WS => Shard 0] Connecting to wss://gateway-us-east1-d.discord.gg?v=10&encoding=json
0|Raiden | 6:43:40 >> [WS => Shard 0] Waiting for event hello for 60000ms
0|Raiden | 6:43:40 >> [WS => Shard 0] Preparing first heartbeat of the connection with a jitter of 0.08895339445151751; waiting 3669ms
0|Raiden | 6:43:40 >> [WS => Shard 0] Resuming session
0|Raiden | resume url: wss://gateway-us-east1-d.discord.gg
0|Raiden | sequence: 90494461
0|Raiden | shard id: 0
0|Raiden | 6:43:42 >> [WS => Shard 0] Invalid session; will attempt to resume: false
0|Raiden | 6:43:42 >> [WS => Shard 0] Destroying shard
0|Raiden | Reason: Invalid session
0|Raiden | Code: 1000
0|Raiden | Recover: Reconnect
0|Raiden | 6:43:42 >> [WS => Shard 0] Connection status during destroy
0|Raiden | Needs closing: true
0|Raiden | Ready state: 1
0|Raiden | 6:43:42 >> [WS => Shard 0] Cancelled initial heartbeat due to #destroy being called
0|Raiden | 6:43:43 >> [WS => Shard 0] Connecting to wss://gateway.discord.gg?v=10&encoding=json
0|Raiden | 6:43:43 >> [WS => Shard 0] Waiting for event hello for 60000ms
0|Raiden | 6:43:43 >> [WS => Shard 0] Preparing first heartbeat of the connection with a jitter of 0.9228258550393542; waiting 38066ms
0|Raiden | 6:43:43 >> [WS => Shard 0] Waiting for identify throttle
0|Raiden | 6:43:43 >> [WS => Shard 0] Identifying
0|Raiden | shard id: 0
0|Raiden | shard count: 2
0|Raiden | intents: 3248127
0|Raiden | compression: none
0|Raiden | 6:43:43 >> [WS => Shard 0] Waiting for event ready for 15000ms
After the last line, bot was restarted automatically.
After the last line, bot was restarted automatically.
do you mean your process crashed?
After the last line, bot was restarted automatically.
do you mean your process crashed?
yes
With no exit error? I sort of doubt that.
yep, nothing
After that pm2 restarted my bot (which is does after crash), and it only has logs which comes after the start of the repo
I'm experiencing a similar situation as @Elitex07 using discord.js 14.11.0:
[WS => Shard 14] Heartbeat acknowledged, latency of 41ms.
[WS => Shard 17] Heartbeat acknowledged, latency of 49ms.
[TEMP] Shard Reconnecting: 14
[WS => Shard 14] The gateway closed with an unexpected code 1006, attempting to resume.
[WS => Shard 14] Destroying shard
Reason: none
Code: 1006
Recover: Resume
[WS => Shard 14] Connection status during destroy
Needs closing: false
Ready state: 3
[WS => Shard 14] Connecting to wss://gateway-us-east1-b.discord.gg?v=10&encoding=json
[WS => Shard 14] Waiting for event hello for 60000ms
[WS => Shard 14] Preparing first heartbeat of the connection with a jitter of 0.8669929779136194; waiting 35763ms
[WS => Shard 14] Resuming session
resume url: wss://gateway-us-east1-b.discord.gg
sequence: 2378
shard id: 14
[WS => Shard 14] Resumed and replayed 1 events
[TEMP] Shard Resume: 14
[WS => Shard 13] Heartbeat acknowledged, latency of 52ms.
[WS => Shard 16] Heartbeat acknowledged, latency of 44ms.
[WS => Shard 11] Heartbeat acknowledged, latency of 63ms.
[WS => Shard 18] Heartbeat acknowledged, latency of 46ms.
[WS => Shard 10] Heartbeat acknowledged, latency of 46ms.
[WS => Shard 15] Heartbeat acknowledged, latency of 42ms.
[WS => Shard 19] Heartbeat acknowledged, latency of 63ms.
[WS => Shard 14] First heartbeat sent, starting to beat every 41250ms
[WS => Shard 14] Heartbeat acknowledged, latency of 48ms.
A shard will randomly disconnect, attempt to resume (seemingly successfully), receives heartsbeat okay, but the shard.ready
property is false and the client appears offline to servers in that shard.
What shard.ready
property? There’s no such property on WebSocketShard
. There is shard.status
which will have a value equal to Status.Ready
though and that is set immediately before emitting the shardResume event.
@Qjuh, let's focus on the issue. It makes my client go offline as well. Though this issue is sort of rare, since 10 days, I have not encountered the issue.
As mentioned by OP, the original issue was in user-land and is resolved. If you can still reproduce this (or a slightly different but potentially related issue), please open your own issue with reference to this one (if relevant).
Which package is this bug report for?
discord.js
Issue description
Running startup on either PM2 or general shard manager leaves the same problem, on a docker container if that's any use to know. I'll start up the bot and it'll go online on all guilds, shard 0 and 1. Shard 1 is fully responsive, sends and acknowledges heartbeats. Shard 0 has not sent any logs but I know tries to start up and stay online, however interactions fail. Shards both launch just fine on running npm run start.
Shard 0 commands will just not respond Shortly goes offline after.
Ram fine (30% usage), CPU fine (150% usage). Shard spawner file: https://gist.github.com/Kaisarion/72fdd396820c2beeb89445960e989ddf Bot.ts file: https://gist.github.com/Kaisarion/76b0bd37185e34b2d4fc3be5e609a08b
CustomClient class:
Attached start-manager.ts file below to launch shards.
Disabled setPresence activity setInterval by guidance of server suggestion. The issue was attempted to be troubleshooted in the DJS Server: see https://discord.com/channels/222078108977594368/1073615824746725506
Code sample
Package version
14.7.1
Node.js version
19.4.0
Operating system
Ubuntu 22.04.1 LTS x86_64
Priority this issue should have
High (immediate attention needed)
Which partials do you have configured?
Channel, Message, Reaction
Which gateway intents are you subscribing to?
Guilds, GuildMembers, GuildBans, GuildVoiceStates, GuildPresences, GuildMessages, GuildMessageReactions, MessageContent
I have tested this issue on a development release
No response