Players in Museum level hold large amounts of private allocated memory.

rdebath commented 2 years ago

When a Museum level is loaded the file is imported into memory and then downloaded to the client. This imported data is private to the player, is not shared in any way and can be quite large.

But the apparent intent is that the map data is never touched again by the server; specifically it's not saved, there are no physics and there are no other players on the level, even if they go to the same museum.

The vast majority of this space is used in the blocks and customBlocks arrays so in line with my suspicions I set these items to null as soon as the level had been sent to the user.

There were no problems and unlike before multiple users going to the museum did not run the service out of memory.

Calling the Cleanup() early worked in the same way (It's reasonable that none of the other cleared items are used either) but had no visible change in memory usage.

Multiple users switching simultaneously could still cause the service to die, but I expect that can be fixed by a mutex allowing only one user to be actually switching at a time.

Can we please make this change standard.

UnknownShadow200 commented 2 years ago

Being able to /copy from museums is a supported use case

Goodlyay commented 2 years ago

I may be misunderstanding but even if we want to allow /copy, each player should not hold a copy of the museum in memory if they are in the same museum, right? One museum instance should be loaded in server ram only once

rdebath commented 2 years ago

(grumbles) Okay, /copy right. Merging duplicate instances of a museum level is still needed. This should help a lot in many use cases. (ie: "come look at this")

But there may be thousands of different backups.

If we have a large 1024x256x1024 level and 100 users start searching different Museum backups that's 25GB of RAM space. Expensive. For a small VM just a couple of instances is enough to cause the service to die (Out of memory in GC).

How about this:

Add a configuration option, "total amount of memory for Museum levels".
If another player is currently on the same Museum No. use that level instance.
When using museum command first free up the current museum level (on failure goto main).
Next check that memory currently allocated to museums is below limit, if not fail to load (leave us on main).
Only then import a new level instance.

I'd say the memory limit should default to something smallish like 100MB, that way only one of these is large levels is allowed. But this memory limit be overridable by users who are allowed to change it, ie a confirm flag should be available.

Possibly also a flag for goto museum without /copy support which is turned on automatically for users who can't /copy -- guest.

Minimum fix

Add a lock (or interlock) around the Level import so that only one is running at a time. This should prevent the GC getting a memory allocation failure. This has the problem that too many loaded museums still prevents normal level loading.

rdebath commented 2 years ago

If anyone needs a quick workaround, this will stop multiple people decompressing a museum at the same time. It will NOT prevent multiple users going to museums, but it seems to stop MCG crashing because of the garbage collector breaking when everyone tries it at the same time.

commit aefdb9f2f86c452cb1db1c44d985914ebbffeed9 (HEAD -> devl)
Author: Robert de Bath <rdebath@tvisiontech.co.uk>
Date:   Thu Sep 30 07:01:02 2021 +0100

    Import singleton

diff --git a/MCGalaxy/Commands/World/CmdMuseum.cs b/MCGalaxy/Commands/World/CmdMuseum.cs
index e36ce6c5c..4c9506a6e 100644
--- a/MCGalaxy/Commands/World/CmdMuseum.cs
+++ b/MCGalaxy/Commands/World/CmdMuseum.cs
@@ -28,6 +28,7 @@ namespace MCGalaxy.Commands.World {

         const string currentFlag = "*current";
         const string latestFlag = "*latest";
+        static int singleton = 0;

         public override void Use(Player p, string message, CommandData data) {
             if (message.Length == 0) { LevelInfo.OutputBackups(p, p.level.MapName); return; }
@@ -89,7 +90,16 @@ namespace MCGalaxy.Commands.World {
         }

         static void JoinMuseum(Player p, string formattedMuseumName, string mapName, string path) {
-            Level lvl   = IMapImporter.GetFor(path).Read(path, formattedMuseumName, false);
+            if (Interlocked.CompareExchange(ref singleton, 1, 0) == 1) {
+                p.Message("Another user is loading a museum, please try again."); return;
+            }
+            Level lvl;
+            try {
+                lvl = IMapImporter.GetFor(path).Read(path, formattedMuseumName, false);
+            } finally {
+                Interlocked.Exchange(ref singleton, 0);
+            }
+
             lvl.MapName = mapName;

             SetLevelProps(lvl);

UnknownShadow200 commented 2 years ago

If anyone needs a quick workaround, this will stop multiple people decompressing a museum at the same time. It will NOT prevent multiple users going to museums, but it seems to stop MCG crashing because of the garbage collector breaking when everyone tries it at the same time.

Do you have error/crash logs from this? Although an OutOfMemory exception should be thrown here, the server itself should not be taken down because of that

UnknownShadow200 commented 2 years ago

With 53549f29310536efb12b002c0c0b90f6d2bf0ae4, at least multiple people going to the same museum should only cause minimal memory usage increase now (blocks/CustomBlocks shared between Level instances)

(although more work will be required still)

rdebath commented 2 years ago

I only have errors like the below in the MC log, I think this is successfully caught. I've tried repeating the "GC" error, but so far I've only been able to slam it so hard the kernel just kills it.

... I suppose that's enough to show that not everything can be caught.

PS: ... Got one after limiting mono with ulimit so it doesn't get kill-9'd. Error: Garbage collector could not allocate 16384 bytes of memory for major heap section. This error made mono drop out, the kernel did NOT kill it.

This particular OOM is with no swap space; adding swap allows more users, but doesn't prevent the issue.

Kernel OOM Killer

Oct 29 16:48:39 vps-d3ae5905 kernel: [1910634.153803] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker-a4c9c627714ad325910b423db24d340555412270731ff1d05bf17503e9fcd104.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-a4c9c627714ad325910b423db24d340555412270731ff1d05bf17503e9fcd104.scope,task=mono,pid=169611,uid=1000
Oct 29 16:48:39 vps-d3ae5905 kernel: [1910634.161161] Out of memory: Killed process 169611 (mono) total-vm:2398732kB, anon-rss:1676788kB, file-rss:0kB, shmem-rss:4kB, UID:1000 pgtables:3636kB oom_score_adj:0
Oct 29 16:48:39 vps-d3ae5905 kernel: [1910634.221378] oom_reaper: reaped process 169611 (mono), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB

MCG Log error.

----09/30/2021 08:07:45 ----
Type: OutOfMemoryException
Source:
Message: Insufficient memory to continue the execution of the program.
Trace:   at (wrapper alloc) System.Object.AllocVector(intptr,intptr)
  at MCGalaxy.Level.Init (System.String name, System.UInt16 width, System.UInt16 height, System.UInt16 length) [0x00068] in <
cc64bc4d07034f81a152a779c9b7efe3>:0
  at MCGalaxy.Level..ctor (System.String name, System.UInt16 width, System.UInt16 height, System.UInt16 length) [0x0014a] in
<cc64bc4d07034f81a152a779c9b7efe3>:0
  at MCGalaxy.Levels.IO.LvlImporter.Read (System.IO.Stream src, System.String name, System.Boolean metadata) [0x0001d] in <cc
64bc4d07034f81a152a779c9b7efe3>:0
  at MCGalaxy.Levels.IO.IMapImporter.Read (System.String path, System.String name, System.Boolean metadata) [0x00007] in <cc6
4bc4d07034f81a152a779c9b7efe3>:0
  at MCGalaxy.Commands.World.CmdMuseum.JoinMuseum (MCGalaxy.Player p, System.String formattedMuseumName, System.String mapNam
e, System.String path) [0x00024] in <cc64bc4d07034f81a152a779c9b7efe3>:0
  at MCGalaxy.Commands.World.CmdMuseum.Use (MCGalaxy.Player p, System.String message, MCGalaxy.CommandData data) [0x00234] in
 <cc64bc4d07034f81a152a779c9b7efe3>:0
  at MCGalaxy.Player.UseCommand (MCGalaxy.Command command, System.String args, MCGalaxy.CommandData data) [0x000e4] in <cc64b
c4d07034f81a152a779c9b7efe3>:0

-------------------------

UnknownShadow200 commented 2 years ago

I tried ulimit -v 100000 and having two levels sized 512x64x512 and 512x36x512 loaded in memory (trying to generate larger maps would cause OutOfMemoryException to get thrown), but was not able to get mono to die with Error: Garbage collector could not allocate

rdebath commented 2 years ago

Generating maps should be pretty safe; the vast majority of memory allocation is for the main byte array and no custom blocks which means there will likely be 100's of kilobytes between your different test sizes.

Loading a museum OTOH allocates the large array then allocates lots of small 4kb chunks for the custom blocks. My guess would be this quickly gets so close to the physical limit that when the GC needs to allocate this tiny bit of space to hold it's management information it fails and then the GC has no option but to bomb.

The GC should trip before this happens, but it also can't know how close it is to the hard limit, especially as this is likely to be less tested (because the hard limit can normally be avoided by adding more swap).

I just noticed that there's a MONO_GC_PARAMS environment variable containing a max-heap-size option. I would suggest that servers have this set with sufficient swap space made available that the program can allocate a bit more than that. (Nevermind, it just gives a different error)

BTW, The ulimits I tried were (probably the last one):

    #ulimit -v 2097152
    #ulimit -v 1048576
    #ulimit -v 1700000

With the map from 2d2t loaded from a museum (several, 8?, times).

ClassiCube / MCGalaxy

Players in Museum level hold large amounts of private allocated memory. #638

How about this:

Minimum fix