Ylianst / MeshCentral

A complete web-based remote monitoring and management web site. Once setup you can install agents and perform remote desktop session to devices on the local network or over the Internet.
https://meshcentral.com
Apache License 2.0
3.67k stars 511 forks source link

RAM leak to the MeshCentral server #6179

Open sheshko-as opened 2 weeks ago

sheshko-as commented 2 weeks ago

Describe the bug During operation, there is a sharp consumption of RAM until it runs out. I increased the amount of memory: it was 4 , then 8, now 16. Increasing the amount of memory does not help. I checked on a dedicated server: when the memory runs out, the service is running, but the memory is at the limit. I checked on the VPS server, the service restarts when memory runs out. The problem may occur once a day, perhaps once every three days, but no patterns have been found. image image image

Information from journalctl:

Jun 14 18:56:41 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/meshcentral.service,task=node,pid=1> Jun 14 18:56:41 kernel: Out of memory: Killed process 15892 (node) total-vm:28494740kB, anon-rss:15090984kB, file-rss:1304kB, shmem-rss:0kB, UID:0 pgtables:49396kB oom_sco> Jun 14 18:56:41 systemd[1]: meshcentral.service: A process of this unit has been killed by the OOM killer. Jun 14 18:56:42 systemd[1]: meshcentral.service: Failed with result 'oom-kill'. Jun 14 18:56:42 systemd[1]: meshcentral.service: Consumed 4h 57min 33.871s CPU time. Jun 14 18:56:52 systemd[1]: meshcentral.service: Scheduled restart job, restart counter is at 1. Jun 14 18:56:52 systemd[1]: Stopped MeshCentral Server. Jun 14 18:56:52 systemd[1]: meshcentral.service: Consumed 4h 57min 33.871s CPU time. Jun 14 18:56:52 systemd[1]: Started MeshCentral Server.

Server Software (please complete the following information):

Client Device (please complete the following information):

Remote Device (please complete the following information):

Your config.json file

{
  "settings": {
    "cert": "XXXXXX",
    "MongoDb": "mongodb://127.0.0.1:27017/meshcentral",
    "WANonly": true,
    "autoBackup": {
      "backupIntervalHours": 24,
      "keepLastDaysBackup": 30,
      "zipPassword": "XXXXXX",
      "webdav": {
        "url": "XXXXXX",
        "username": "XXXXXX",
        "password": "XXXXXX",
        "folderName": "XXXXXX",
        "maxFiles": 30
      }
    }
  },
  "domains": {
    "": {
      "title": "XXXXXX",
      "title2": "XXXXXX",
      "hide": 5
    }
  },
  "letsencrypt": {
    "email": "XXXXXX@XXXXXX",
    "names": "XXXXXX",
    "production": true
  }
}
si458 commented 2 weeks ago

something happened between 11 and 12 from the looks of the graph

going to sound like a DAFT one, can you disable/remove the autoBackup and restart and monitor?

and the fact it looks like its loading itself over and over again in the pic doesnt look good either?

sheshko-as commented 2 weeks ago

going to sound like a DAFT one, can you disable/remove the autoBackup and restart and monitor?

I disabled autobackup, rebooted the server, and watched the server work.

sheshko-as commented 2 weeks ago

Sometimes this error appears in the logs:

-------- 6/16/2024, 9:33:59 PM ---- 1.1.24 --------

(node:55552) Warning: An error event has already been emitted on the socket. Please use the destroy method on the socket while handling a 'clientError' event. (Use node --trace-warnings ... to show where the warning was created)

but I don't think it's related to the problem

si458 commented 2 weeks ago

@sheshko-as that issue has been around for about a year, it first popped up when we had to move to node 14 and upgraded expressjs havent been able to track down what line is causing it yet but i dont think its effecting you UNLESS the timestamp of the event is WHEN you notice memory being increased?

sheshko-as commented 1 week ago

going to sound like a DAFT one, can you disable/remove the autoBackup and restart and monitor?

It didn't help

sheshko-as commented 1 week ago

Server Error Log: -------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

<--- Last few GCs --->

[89911:0x5ff48b0] 17700692 ms: Mark-sweep 4047.0 (4138.1) -> 4034.2 (4141.1) MB, 2816.1 / 0.0 ms (average mu = 0.346, current mu = 0.030) allocation failure; scavenge might not succeed [89911:0x5ff48b0] 17705461 ms: Mark-sweep 4050.1 (4141.1) -> 4037.5 (4144.4) MB, 4691.7 / 0.0 ms (average mu = 0.167, current mu = 0.016) allocation failure; scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

1: 0xb9c310 node::Abort() [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

2: 0xaa27ee [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

3: 0xd73eb0 v8::Utils::ReportOOMFailure(v8::internal::Isolate, char const, bool) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

4: 0xd74257 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate, char const, bool) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

5: 0xf515d5 [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

6: 0xf63aad v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

7: 0xf3e19e v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

8: 0xf3f567 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

9: 0xf2076a v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

10: 0x12e599f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long, v8::internal::Isolate) [/usr/bin/node]

-------- 6/19/2024, 9:02:28 PM ---- 1.1.24 --------

11: 0x17125f9 [/usr/bin/node]

sheshko-as commented 1 week ago

It also didn't help "Useful config.js settings": "AgentsInRAM": false, "AgentUpdateBlockSize": 2048, "agentUpdateSystem": 1, "noAgentUpdate": 1, "WsCompression": false, "AgentWsCompression": false,

https://ylianst.github.io/MeshCentral/meshcentral/debugging/

silversword411 commented 1 week ago

Are you using some kind of VPN/proxy between agent and server?

Can you monitor the ws connections between agent and server? By default they will stay up once established for 24hrs...but I've seen VPN and other networking software/proxies prematurely close ws connections.

When mesh realizes its connection is dead it stands up a new connection but I think the old isn't cleaned....memory leak.

si458 commented 1 week ago

@silversword411 im not sure if thats the case? because my other post here about duplicate agents fixes this exact issue https://github.com/Ylianst/MeshCentral/discussions/6205#discussioncomment-9880303 if it notices duplicate connects/agents, it disconnects the first one lets the second one through altho it could not closing the connect properly i suppose?

si458 commented 1 week ago

we need to find out what you did/happens when the memory starts to climb did you connect to a certain device? did loaads of devices connect? did someone stop a recording session and its saving from memory to file? was the loads of login attempts? have u tried switching from mongodb to just nedb or say mysql? incase its a database issue

sheshko-as commented 1 week ago

Are you using some kind of VPN/proxy between agent and server?

We don't use it, agents connect directly by domain name

sheshko-as commented 1 week ago

we need to find out what you did/happens when the memory starts to climb did you connect to a certain device? did loaads of devices connect? did someone stop a recording session and its saving from memory to file? was the loads of login attempts? have u tried switching from mongodb to just nedb or say mysql? incase its a database issue

I'm trying to monitor all this, but I don't see any pattern yet. The database is the only thing that I have not changed yet, I will try to change it to postgresql, for example. There are specifics: we have a lot of client computers with a frozen C disk through the Shadow Defender program, as well as a lot of computers that work without disks through one vhd image (for example, one vhd disk can be on 30-40 PCs at once) for these computers, it is worth deleting in the group settings when the computer is offline.

si458 commented 1 week ago

@sheshko-as wow that does sound mad/complex! i mean you could be duplicating the meshids if you are using VHD images!? and that could be causing an issue/confusion of the server! so maybe yeh set a few groups to delete themselves when they disconnect and see if it helps?

sheshko-as commented 1 week ago

@sheshko-as wow that does sound mad/complex! i mean you could be duplicating the meshids if you are using VHD images!? and that could be causing an issue/confusion of the server! so maybe yeh set a few groups to delete themselves when they disconnect and see if it helps?

All groups that use one VHD image per group are configured to be deleted after the computer becomes offline.

si458 commented 1 week ago

@sheshko-as hmmm and they vanish ok? this is getting very stange now as i liteally have no idea whats causing the memory leak/increase you COULD try running node --trace-warnings node_modules/meshcentral and see if that every displays any output before it crashes it could very well be the --trace-warnings issue, also you could try using the latest node 20.15.0 as the was push for a fix to do with misleaking messages https://github.com/nodejs/node/pull/51204

sheshko-as commented 1 week ago

hmmm and they vanish ok?

yes

sheshko-as commented 1 week ago

you COULD try running node --trace-warnings node_modules/meshcentral and see if that every displays any output before it crashes it could very well be the --trace-warnings issue, also you could try using the latest node 20.15.0 as the was push for a fix to do with misleaking messages nodejs/node#51204

Okay, I'll do everything, I'll write you back based on the result

si458 commented 1 week ago

@sheshko-as no worries! in theory the node 20.13+ will fix the Warning: An error event has already been emitted on the socket. Please use the destroy method on the socket while handling a 'clientError' event. messages (explains why im not seeing them anymore as i moved to node 20) but i will downgrade my setup to node 18 and monitor see if it increases ram wise too

sheshko-as commented 1 week ago

@sheshko-as no worries! in theory the node 20.13+ will fix the Warning: An error event has already been emitted on the socket. Please use the destroy method on the socket while handling a 'clientError' event. messages (explains why im not seeing them anymore as i moved to node 20) but i will downgrade my setup to node 18 and monitor see if it increases ram wise too

updated to 20.15, did not help, the problem was repeated today

si458 commented 1 week ago

@sheshko-as which issue sorry? u mean the memory increase/crash? did you see when it started increasing and if you did anything like connect to a computer? in theory the An error event has already been emitted should vanish with the latest LTS of node 20

sheshko-as commented 1 week ago

did you see when it started increasing and if you did anything like connect to a computer?

I'm trying to figure out what action causes uncontrolled RAM growth to begin, but it's not working yet.

sheshko-as commented 1 week ago

u mean the memory increase/crash?

yes

sheshko-as commented 2 days ago

I think I found it: the problem is due to relay sessions, for many users, when opening a lot of RDP, several MeshRouter windows open, and not one, as expected. I think the problem occurs when meshcentral users run multiple copies of the MeshCentralRouter application due to an error when clicking the RDP button in the browser. If the user has only one instance of the MeshCentralRouter program running, RAM does not grow. I'll try to find the reason why this is happening. image image

si458 commented 2 days ago

OK that's an interesting theory!

So its opening multiple meshcentralrouter sessions that seems to increase the ram on the server side!

I'll have to test myself, sadly I don't have 32 comps that have rdp. But I suppose I could open 32 remote desktops and see if the memory starts increasing!