games647 / FastLogin

Checks if a minecraft player has a valid paid account. If so, they can skip offline authentication automatically. (premium auto login)
https://www.spigotmc.org/resources/fastlogin.14153
MIT License
501 stars 121 forks source link

FastLogin causes BungeeCord Netty threads to hang. #754

Closed masecla22 closed 2 years ago

masecla22 commented 2 years ago

What happened?

When using FastLogin seems to cause the Bungeecord Netty Threads to hang.

Some relevant information is that

  1. We are running on a replicated instance network with multiple bungee cords
  2. We use Cloudflare to mask the IP (not the Spectrum service, just regular Cloudflare)

We implemented a workaround by editing the plugin source, and replacing everything slf4j with System.out.println calls, and it seems to work. This is obviously not the best solution, which is why we are reporting the bug.

This seems to be a deadlock within bungee, either due to some concurrency issue, or something in slf4j which causes it to hang.

What did you expect?

Bungee cord should function as expected.

Steps to reproduce

  1. Install the plugin
  2. Login

It is inconsistent though.

Plugin list

AuthMeBungee, FastLoginBungee, LuckPermsBungee, RedisBungee, spark

Configuration file

https://www.toptal.com/developers/hastebin/cosanosiva.yaml

Server log

Server log files show no errors, messages or debug messages. However, when we noticed the server hanging we sent a SIG3 to the Bungee instance to dump the javacore file, with the relevant stacktrace attached (since the full file is too large for text sharing services).

These are the threads getting locked.
https://www.toptal.com/developers/hastebin/dokigifuhu.apache

Plugin version

1.11-SNAPSHOT-812cf21

Platform

BungeeCord

Relevance

games647 commented 2 years ago

According to the log, it's more like FastLogin is waiting on a lock from another thread, here the BungeeCord logging thread. That lock isn't released and therefore FastLogin cannot continue.

EDIT: So more a Bungee fault.

games647 commented 2 years ago

As SLF4J is proxied using the standard official Bungee JUL (java.util.logging) API back, there should be no difference. You could also try out to replace SLF4J with JUL logging, but I guess the situation will be the same.

masecla22 commented 2 years ago

Alright, so we found out the issue after a bunch of testing (and it totally did not take 6 entire months of this issue going unnoticed)

I'll be offering the full context.

  1. We are running under a custom instance replication system
  2. We used ProcessBuilder to start our instances

This was our starting instance code:

public void start(String startCommand) throws IOException {
    startCommand = startCommand.replace("%server%", this.getName());
    ProcessBuilder builder = new ProcessBuilder(startCommand.split(" "));
    builder = builder.directory(workingDirectory);
    this.process = builder.start().toHandle();
}

With this code, all log files seemed to "crash" at about 64kb and then all sorts of weird things would happen.

Anything logging stuff before doing an action would become unresponsive, in our case.

  1. Fastlogin would not quite work with AuthmeBungeecord
  2. Instances would refuse to stop
  3. RedisBungee gave an unfixable "Already logged in"

Turns out, the ProcessBuilder has an internal buffer of 64k, and will just hang if you try to give it more data it will simply hang.

In our case, replacing our starting code with this

public void start(String startCommand) throws IOException {
    startCommand = startCommand.replace("%server%", this.getName());
    File bitConsumer = getBitConsumer();
    ProcessBuilder builder = new ProcessBuilder(startCommand.split(" "))
            .redirectOutput(ProcessBuilder.Redirect.appendTo(bitConsumer))
            .redirectError(ProcessBuilder.Redirect.appendTo(bitConsumer));
    builder = builder.directory(workingDirectory);
    this.process = builder.start().toHandle();
}

private File getBitConsumer() {
    File bitbucket;
    if (OSCheck.getOperatingSystemType().equals(OSType.Windows)) {
        bitbucket = new File("NUL");
    } else {
        bitbucket = new File("/dev/null");
    }
    return bitbucket;
}

(as per https://stackoverflow.com/questions/31191391/java-processbuilder-limits) seems to have solved the issue.

This was an extremely dumb bug, and we apologise for wasting time with it ❤️

~ Matt

(ps: I'm leaving it open so the author can have a look, after which you can close it)