Closed wingzero54 closed 3 years ago
What happens if we reduce the ram even more, to like 20-24gb? This would result in GC pauses happening more often, but being shorter. If the ram is being shorter, the pauses should be short enough as to be unnoticeable.
I don't think it would work, the server regularly pushes upwards of 12gb of ram into old gen, which means something is using very large amounts of ram. Considering regular minecraft servers recommend 10 players per 1gb of ram, we have plugins using gigantic amounts of ram. Shortening it would cause GC to occur more frequently, but it would still cause lag spikes, just more small ones instead of less big ones
To expand on the thought of reducing the RAM, what I've always heard is that java's GC doesn't work well having a large abundance of RAM. If it has an abundance, it will get "lazy" and will stop freeing memory. It then will eventually reach the upper limit of the RAM it was given, and then have to do a massive stop the world sweep.
So here's a couple examples with 52gb ram https://timings.aikar.co/?id=98637e51acbe4e718f8539463f02cf24 and https://timings.aikar.co/?id=5f4171dff38e4f8f82244914815993ab with one 7.3 second GC every 5 hours lets say (if I'm reading it correctly). Compared to the above, which is 4.7 second every 12 minutes. So the time to GC reduced by 35% while the time between GC reduce by 96%. I think that shows that even if we GC more often, the time it takes to do it each time is not reducing enough to eliminate the lag
Have you tried switching to ZGC (https://wiki.openjdk.java.net/display/zgc/Main) ? I experimented a bit with it a while ago and ran into some issues, but those should be fixable. I'd expect this to be the biggest improvement you can get in comparison to effort required.
Debugging memory consumption past that would require analyzing a memory dump of the running server (last time I tried making one of those it killed the server, so careful), which contains all kinds of sensitive information (not only ingame stuff, but also database passwords, tokens etc.), so the list of people who could do that is not too long.
That being said, why would we run on only 32 GB of RAM? Seems like too little for our use case.
Just a test to see how the old gen GC would respond, and it still took longer than expected each time. We may be able to do a memory dump now, the new server hardware is much beefier
Can you take a memory snapshot and see what types of objects are using the most RAM?
So here's before, during, after an old gen GC, roughly https://spark.lucko.me/71jp1mwAQw https://spark.lucko.me/dmtA2CngXX https://spark.lucko.me/0Lmp9USC6a
At 70 players, the server is clearing 20gb out of young gen memory every minute with 70-100ms each, and doing old gen gc every 10 minutes at 7 seconds. So memory usage and GC take a big strain on the server at high player count. Here's a timings (the big spike is my heap dump summary running I think) https://timings.aikar.co/?id=f295700033954eceb8a3ba90c3fa8791
I don't know if it's related, but CMC sometimes throws constant thread dumps which also lock up the server. May or may not be related to RAM usage. See #99
Looks like ZGC was only initially released on JDK11 and it has gotten a lot of improvements over time to JDK16. So I suggest we wait to try ZGC until we move to Java 16, as on Test ZGC is much rougher than G1GC with Java11
This issue with Old Gen GC was entirely due to keeping jukealert logs in memory and was resolved by removing the logs from memory
I lowered ram from 52gb to 32gb and it revealed that old gen GC is a big problem. Every time it runs it's a guaranteed 7 second lag spike. At 52gb of ram it runs every several hours, at 32gb of ram old gen GC runs every few minutes.
I think CMC was shown to be contributing, so does CMC use large amounts of ram? https://timings.aikar.co/?id=e525712974db4306890f08dc66caef64 https://spark.lucko.me/yEFYhtq5wF