Open rdebath opened 2 years ago
Being able to /copy
from museums is a supported use case
I may be misunderstanding but even if we want to allow /copy, each player should not hold a copy of the museum in memory if they are in the same museum, right? One museum instance should be loaded in server ram only once
(grumbles)
Okay, /copy
right.
Merging duplicate instances of a museum level is still needed.
This should help a lot in many use cases. (ie: "come look at this")
But there may be thousands of different backups.
If we have a large 1024x256x1024 level and 100 users start searching different Museum backups that's 25GB of RAM space. Expensive. For a small VM just a couple of instances is enough to cause the service to die (Out of memory in GC).
main
). I'd say the memory limit should default to something smallish like 100MB, that way only one of these is large levels is allowed. But this memory limit be overridable by users who are allowed to change it, ie a confirm
flag should be available.
Possibly also a flag for goto museum without /copy
support which is turned on automatically for users who can't /copy
-- guest.
Add a lock (or interlock) around the Level import so that only one is running at a time. This should prevent the GC getting a memory allocation failure. This has the problem that too many loaded museums still prevents normal level loading.
If anyone needs a quick workaround, this will stop multiple people decompressing a museum at the same time. It will NOT prevent multiple users going to museums, but it seems to stop MCG crashing because of the garbage collector breaking when everyone tries it at the same time.
commit aefdb9f2f86c452cb1db1c44d985914ebbffeed9 (HEAD -> devl)
Author: Robert de Bath <rdebath@tvisiontech.co.uk>
Date: Thu Sep 30 07:01:02 2021 +0100
Import singleton
diff --git a/MCGalaxy/Commands/World/CmdMuseum.cs b/MCGalaxy/Commands/World/CmdMuseum.cs
index e36ce6c5c..4c9506a6e 100644
--- a/MCGalaxy/Commands/World/CmdMuseum.cs
+++ b/MCGalaxy/Commands/World/CmdMuseum.cs
@@ -28,6 +28,7 @@ namespace MCGalaxy.Commands.World {
const string currentFlag = "*current";
const string latestFlag = "*latest";
+ static int singleton = 0;
public override void Use(Player p, string message, CommandData data) {
if (message.Length == 0) { LevelInfo.OutputBackups(p, p.level.MapName); return; }
@@ -89,7 +90,16 @@ namespace MCGalaxy.Commands.World {
}
static void JoinMuseum(Player p, string formattedMuseumName, string mapName, string path) {
- Level lvl = IMapImporter.GetFor(path).Read(path, formattedMuseumName, false);
+ if (Interlocked.CompareExchange(ref singleton, 1, 0) == 1) {
+ p.Message("Another user is loading a museum, please try again."); return;
+ }
+ Level lvl;
+ try {
+ lvl = IMapImporter.GetFor(path).Read(path, formattedMuseumName, false);
+ } finally {
+ Interlocked.Exchange(ref singleton, 0);
+ }
+
lvl.MapName = mapName;
SetLevelProps(lvl);
If anyone needs a quick workaround, this will stop multiple people decompressing a museum at the same time. It will NOT prevent multiple users going to museums, but it seems to stop MCG crashing because of the garbage collector breaking when everyone tries it at the same time.
Do you have error/crash logs from this? Although an OutOfMemory exception should be thrown here, the server itself should not be taken down because of that
With 53549f29310536efb12b002c0c0b90f6d2bf0ae4, at least multiple people going to the same museum should only cause minimal memory usage increase now (blocks/CustomBlocks shared between Level instances)
(although more work will be required still)
I only have errors like the below in the MC log, I think this is successfully caught. I've tried repeating the "GC" error, but so far I've only been able to slam it so hard the kernel just kills it.
... I suppose that's enough to show that not everything can be caught.
PS: ...
Got one after limiting mono with ulimit
so it doesn't get kill-9'd.
Error: Garbage collector could not allocate 16384 bytes of memory for major heap section.
This error made mono drop out, the kernel did NOT kill it.
This particular OOM is with no swap space; adding swap allows more users, but doesn't prevent the issue.
Kernel OOM Killer
Oct 29 16:48:39 vps-d3ae5905 kernel: [1910634.153803] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-a4c9c627714ad325910b423db24d340555412270731ff1d05bf17503e9fcd104.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-a4c9c627714ad325910b423db24d340555412270731ff1d05bf17503e9fcd104.scope,task=mono,pid=169611,uid=1000
Oct 29 16:48:39 vps-d3ae5905 kernel: [1910634.161161] Out of memory: Killed process 169611 (mono) total-vm:2398732kB, anon-rss:1676788kB, file-rss:0kB, shmem-rss:4kB, UID:1000 pgtables:3636kB oom_score_adj:0
Oct 29 16:48:39 vps-d3ae5905 kernel: [1910634.221378] oom_reaper: reaped process 169611 (mono), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB
MCG Log error.
----09/30/2021 08:07:45 ----
Type: OutOfMemoryException
Source:
Message: Insufficient memory to continue the execution of the program.
Trace: at (wrapper alloc) System.Object.AllocVector(intptr,intptr)
at MCGalaxy.Level.Init (System.String name, System.UInt16 width, System.UInt16 height, System.UInt16 length) [0x00068] in <
cc64bc4d07034f81a152a779c9b7efe3>:0
at MCGalaxy.Level..ctor (System.String name, System.UInt16 width, System.UInt16 height, System.UInt16 length) [0x0014a] in
<cc64bc4d07034f81a152a779c9b7efe3>:0
at MCGalaxy.Levels.IO.LvlImporter.Read (System.IO.Stream src, System.String name, System.Boolean metadata) [0x0001d] in <cc
64bc4d07034f81a152a779c9b7efe3>:0
at MCGalaxy.Levels.IO.IMapImporter.Read (System.String path, System.String name, System.Boolean metadata) [0x00007] in <cc6
4bc4d07034f81a152a779c9b7efe3>:0
at MCGalaxy.Commands.World.CmdMuseum.JoinMuseum (MCGalaxy.Player p, System.String formattedMuseumName, System.String mapNam
e, System.String path) [0x00024] in <cc64bc4d07034f81a152a779c9b7efe3>:0
at MCGalaxy.Commands.World.CmdMuseum.Use (MCGalaxy.Player p, System.String message, MCGalaxy.CommandData data) [0x00234] in
<cc64bc4d07034f81a152a779c9b7efe3>:0
at MCGalaxy.Player.UseCommand (MCGalaxy.Command command, System.String args, MCGalaxy.CommandData data) [0x000e4] in <cc64b
c4d07034f81a152a779c9b7efe3>:0
-------------------------
I tried ulimit -v 100000
and having two levels sized 512x64x512 and 512x36x512 loaded in memory (trying to generate larger maps would cause OutOfMemoryException to get thrown), but was not able to get mono to die with Error: Garbage collector could not allocate
Generating maps should be pretty safe; the vast majority of memory allocation is for the main byte array and no custom blocks which means there will likely be 100's of kilobytes between your different test sizes.
Loading a museum OTOH allocates the large array then allocates lots of small 4kb chunks for the custom blocks. My guess would be this quickly gets so close to the physical limit that when the GC needs to allocate this tiny bit of space to hold it's management information it fails and then the GC has no option but to bomb.
The GC should trip before this happens, but it also can't know how close it is to the hard limit, especially as this is likely to be less tested (because the hard limit can normally be avoided by adding more swap).
I just noticed that there's a (Nevermind, it just gives a different error)MONO_GC_PARAMS
environment variable containing a max-heap-size
option. I would suggest that servers have this set with sufficient swap space made available that the program can allocate a bit more than that.
BTW, The ulimits I tried were (probably the last one):
#ulimit -v 2097152
#ulimit -v 1048576
#ulimit -v 1700000
With the map from 2d2t
loaded from a museum (several, 8?, times).
When a Museum level is loaded the file is imported into memory and then downloaded to the client. This imported data is private to the player, is not shared in any way and can be quite large.
But the apparent intent is that the map data is never touched again by the server; specifically it's not saved, there are no physics and there are no other players on the level, even if they go to the same museum.
The vast majority of this space is used in the
blocks
andcustomBlocks
arrays so in line with my suspicions I set these items tonull
as soon as the level had been sent to the user.There were no problems and unlike before multiple users going to the museum did not run the service out of memory.
Calling the
Cleanup()
early worked in the same way (It's reasonable that none of the other cleared items are used either) but had no visible change in memory usage.Multiple users switching simultaneously could still cause the service to die, but I expect that can be fixed by a mutex allowing only one user to be actually switching at a time.
Can we please make this change standard.