PaperMC / Waterfall

BungeeCord fork that aims to improve performance and stability.
https://papermc.io
MIT License
742 stars 302 forks source link

getServersCopy() - NoSuchElementException #645

Open Sneakometer opened 3 years ago

Sneakometer commented 3 years ago

Hey, i'm developing for Waterfall and need to retrieve the server list every now and then. I'm using the getServersCopy() method provided by Waterfall in the ProxyServer instance, as the other one is marked as deprecated. However, i started to notice some weird behaviour lately, throwing the following exception everytime the getServersCopy() method is called:

[21:43:23] [pool-10-thread-5/ERROR]: java.util.NoSuchElementException
[21:43:23] [pool-10-thread-5/ERROR]:     at gnu.trove.impl.hash.THashIterator.moveToNextIndex(THashIterator.java:139)
[21:43:23] [pool-10-thread-5/ERROR]:     at gnu.trove.impl.hash.THashIterator.next(THashIterator.java:91)
[21:43:23] [pool-10-thread-5/ERROR]:     at gnu.trove.map.hash.TCustomHashMap$MapBackedView.toArray(TCustomHashMap.java:729)
[21:43:23] [pool-10-thread-5/ERROR]:     at com.google.common.collect.Iterables.toArray(Iterables.java:295)
[21:43:23] [pool-10-thread-5/ERROR]:     at com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:406)
[21:43:23] [pool-10-thread-5/ERROR]:     at com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:391)
[21:43:23] [pool-10-thread-5/ERROR]:     at net.md_5.bungee.conf.Configuration.getServersCopy(Configuration.java:189)
[21:43:23] [pool-10-thread-5/ERROR]:     at net.md_5.bungee.BungeeCord.getServersCopy(BungeeCord.java:648)

Restarting the proxy fixed this for some time. The issue will however come back after some hours to a few days. I'm also using the Cloudnet v3.4 orchestration to dynamically create servers at runtime. First i suspected the problem there, but a developer told me it's not caused by cloudnet.

I'm running the build from @Janmm14 (https://github.com/PaperMC/Waterfall/pull/609) for a while but only recently noticed this error. So i thought it's unlikely to be related to the dos mitigations.

Java 8 Debian 10

I'm not sure how to reproduce this other than just lettings the proxy run for a while. I am willing to investigate furthe,r but i'm kinda out of ideas now.

//Edit the code that triggers the exception is:

for (ServerInfo server : ProxyServer.getInstance().getServersCopy().values()) {
...
}
electronicboy commented 3 years ago

Something mutated the map while you where tryna create a copy of it seemingly

Sneakometer commented 3 years ago

How to avoid this? Is there a thread safe way modifying or cloning it?

electronicboy commented 3 years ago

modifying the servers list is best done with the remove/add methods on the ProxyServer class, we already lock the collection in many cases but there is a limitation due to getServers itself returning a mutable list, part of the ideal setup would be to make this immutable, but, shamefully plugins get in the way of that, but, I question if the better alternative might be to override the relevant methods to direct calls to remove, etc, to the synchronized methods for this, but it is pretty gross

Sneakometer commented 3 years ago

modifying the servers list is best done with the remove/add methods on the ProxyServer class,

Found them in the Configuration class, thanks. They synchronize on the serversLock object. Is this a feature from Waterfall, as i can't find this in BungeeCord itself? Also, i found that there are still many cases where the map is accessed without synchronizing on the serversLock object in BungeeCord. This can eventually cause the same issue, or am i wrong?

What is the reason for using troves tmap in the first place? Why not use a ConcurrentHashMap, which would fix most if not all of the concurrency issues?

Janmm14 commented 3 years ago

@Sneakometer The idea is probably that for most servers the server map is not changing, that's probably why it is using a fast map and manual synchronization

astei commented 3 years ago

Sadly, the broken BungeeCord API in question can't be fixed. If a plugin only uses the Waterfall methods, then everything will work, but CloudNet presumably needs to support upstream BungeeCord too, thus you get the issue.

I'm going to sound like a broken record, but that's just one reason why @electronicboy encouraged me to go on my own instead.

Sneakometer commented 3 years ago

@Sneakometer The idea is probably that for most servers the server map is not changing, that's probably why it is using a fast map and manual synchronization

I'm not a java expert, but i can't believe that troves map is that much faster for those few entries (<100) in the servers map.

I'm going to sound like a broken record, but that's just one reason why @electronicboy encouraged me to go on my own instead.

I already thought about switching to velocity at some point, but it requires quite a lot of hassle for me adding the support to all of my self made plugins. Some of them (at my fault) not using any abstraction to bungeecord at all. I was fine with bungeecord before using cloudnet for years, so i'm kinda hoping there is another solution to this :/

astei commented 3 years ago

There really isn't, unless we break BungeeCord API compatibility. And at that point, you might as well just use Velocity.

electronicboy commented 3 years ago

using a CHM as-is is not viable as we'd need to override a good chunk of its functionality in order to work properly The performance is not really a consideration here at all, it's the API which is the issue

Janmm14 commented 3 years ago

You could ask CloudNet to use Waterfall's functions if they're found.

Packsolite commented 12 months ago

So after years this issue still persists, even with the suggested fix by cloudnet in place. The trove map can get into a "corrupted" state from which it will not recover. This code will then always procude the mentioned exception:

for (ServerInfo server : ProxyServer.getInstance()
        .getServersCopy()
        .values()) {
    server.sendData("abc", data);
}

Only temporary "fix" is to restart the proxy.

electronicboy commented 12 months ago

This is likely a won't fix as the underlying issue is generally the unsafe collection that upstream has decided to use here