discord-jda / JDA

Java wrapper for the popular chat & VOIP service: Discord https://discord.com
Apache License 2.0
4.35k stars 734 forks source link

Add support for session checkpoints #1266

Open MinnDevelopment opened 4 years ago

MinnDevelopment commented 4 years ago

General Troubleshooting

Feature Request

The cache is currently thrown away the moment the program restarts. We could provide a way to dump a checkpoint file of the current cache to allow resuming the session after restarting. This could be extremely useful to bigger bots that would otherwise exhaust their session start rate limit too quickly.

Exactly how this would be implemented is up for debate and might have to wait for the new auto-sharding to be implemented first.

Example Use-Case

  1. Disconnect

    File checkpoint = new File(shardId + "-checkpoint.jda");
    jda.detach(checkpoint); // detach session and disconnect
    System.exit(0);
  2. Resume

    JDA shard = JDA.resume(checkpoint);
Andre601 commented 4 years ago

I'm a bit confused... If I now understand this right would using this create a (temporary) file with information to later use for a direct connect, without first logging into the session or how is that process to understand correctly?

MrPowerGamerBR commented 4 years ago

Very late response, but this is how I understood the idea @Andre601:

If you have a big bot, reconnecting sessions takes a looong time (even if you have x16 login it still takes a while) and uses precious logins (you have limited logins, if you use up all your logins... well, your token is reset, and that's bad).

Here's an example when you need to update your bot:

Without session checkpoints:

With session checkpoints:

This is a great idea for big bots since you could resume sessions without logging in all the shards again.

However I may be completely wrong in my interpretation of the feature, so sorry if I made a mistake!

Andre601 commented 4 years ago

I don't think that is the case as there seems to only be one login at the start and the shards just connect to the websocket. In addition does the bot when being disconnected on a shard not try a complete relogin, but just a reconnect. Otherwhise this could easaly hit the rate limit very fast.

MrPowerGamerBR commented 4 years ago

@Andre601 every shard has a different websocket connection. If you have 512 shards, you will need to login to the WebSocket 512 times and send a IDENTIFY each time (yes, each of them needs to IDENTIFY before they can receive events, I may be wrong but I'm 99% sure that's how it works). This uses valuable logins.

According to the example in the original issue, this would be useful if you need to do some downtime on the bot (updates and stuff like that) by saving the bot state into a file, then, on reboot, you load the checkpoint file, allowing you to resume the session without reidentifying. This is very useful if you have a bot that uses a lot of shards but doesn't has x16 login support yet (with causes shards to take up to 30 minutes just to relog!)

In theory you are able to resume the session without any issues by storing the session ID + sequence ID + all loaded guilds to a file and then reloading them when starting the bot again. (Of course, you can resume a session with only the Session ID + Sequence ID, and that's very easy to do with a bit of Reflection magic, but of course, JDA will not trigger any events because the guilds are missing)

Andre601 commented 4 years ago

From my experiences with my own bot and by checking the logs does JDA send such messages:

[ 06.10.2020 17:31:39 INFO  ] [main] [ShardManager] - Login Successful!
[ 06.10.2020 17:31:39 INFO  ] [JDA [0 / 27] MainWS-ReadThread] [WebSocketClient] - Connected to WebSocket
[ 06.10.2020 17:31:40 INFO  ] [JDA [0 / 27] MainWS-ReadThread] [JDA] - Finished Loading!
[ 06.10.2020 17:31:40 INFO  ] [JDA [0 / 27] MainWS-ReadThread] [ReadyListener] - Shard 0 ready! # This is my own log message
[ 06.10.2020 17:31:39 INFO  ] [JDA [1 / 27] MainWS-ReadThread] [WebSocketClient] - Connected to WebSocket
[ 06.10.2020 17:31:40 INFO  ] [JDA [1 / 27] MainWS-ReadThread] [JDA] - Finished Loading!
[ 06.10.2020 17:31:40 INFO  ] [JDA [1 / 27] MainWS-ReadThread] [ReadyListener] - Shard 1 ready! # This is my own log message

As you can see is the "Login successful" only send once and not for each shard separately so we can safely assume that an actual login only happens once,

I think we should differentiate between "resuming" and "reconnecting" a session/shard. A resume does to my knowledge not take another login as the connection was just (intentionally) lost temporarily, while on reconnecting it essentially was closed and a new connection needs to be established. This is, of course, my understanding of this and if there is an actual definition for those two (in terms of what Discord understands between those two things) would I like to see it.

My point was mostly about resuming connections here, which don't really take any additional logins while the topic (now that I looked closer at the PR itself) seems more about a complete bot restart/shutdown which would cause a complete reconnect.

But my tl;dr here is that from what I gathered and saw in the logs does the Bot only log in once using the identify payload and the number of shards it should have, and afterwards just start the shards one by one.

To close this off do I believe that this should be moved to the Discord server as I don't want to continue in flooding this PR with (possibly) unrelated stuff.

MrPowerGamerBR commented 2 years ago

I made a super stupid, bad, and hacky implementation of this, implementing my own idea that I had two years ago.

It works by persisting all guilds to a file when JDA shuts down, the stored format is the same used by the GUILD_CREATE event. When booting up, the gateway session is resumed and all stored events are dispatched to JDA.

Sadly it requires a JDA fork since I needed to make some internal changes to support it, but it does work, and maybe in the future I will clean it up and submit a PR. :3

Anyhow, here's my implementation of it! https://github.com/LorittaBot/DeviousJDA/blob/master/src/examples/java/SessionCheckpointAndGatewayResumeExample.kt#L32

If it was properly implemented, ofc I would not rely on that super crazy hacky hack.

I think the best way of handling it would be by creating a DefaultShardManagerBuilder#setCheckpointProvider or something like that, where you would provide the checkpointed data for the shard ID, which a CompletableFuture (maybe? in Kotlin it would be a () -> CheckpointData) that, when the shard is resumed, invokes the CompletableFuture to load the data (yeah, that blocks the gateway read thread, but in my experience it takes around ~2s to load and fully dispatch all Guild Create events, so it is fast enough to not cause any issues and keeps the code simple, and besides, you don't want to spend time loading the checkpoint data just to end up receiving a invalid session when trying to resume lol)