Process list incomplete after reboot

lpgera commented 3 months ago

What's going wrong?

I have several Node.js processes managed by pm2, and I've installed the default systemd startup script. When I reboot the machine, pm2 starts up with an incomplete list of processes.

How could we reproduce this issue?

I managed to narrow down the issue to the following factors:

I have one process from the many that takes a long time to shut down and exit (let's say 100 seconds)
During the shutdown the systemd service will time out after 90 seconds by default (TimeoutStopUSec=1min 30s)
This will cause pm2 to dump the process list in an incomplete state (The following line appears in the logs after 90 seconds: pm2 has been killed by signal, dumping process list before exit...)

I believe #3865 was supposed to solve this issue, but apparently it didn't cover all the cases. For my use-case it'd be nice to have the option to never dump the process list during a shutdown.

Supporting information

I tried with the latest 5.4.2 version of pm2 on Ubuntu 24.04 LTS with Node.js 20.16.0.

--- PM2 report ----------------------------------------------------------------
Date                 : Mon Aug 12 2024 11:40:26 GMT+0200 (Central European Summer Time)
===============================================================================
--- Daemon -------------------------------------------------
pm2d version         : 5.4.2
node version         : 20.16.0
node path            : not found
argv                 : /home/gera/.nvm/versions/node/v20.16.0/bin/node,/home/gera/.nvm/versions/node/v20.16.0/lib/node_modules/pm2/lib/Daemon.js
argv0                : node
user                 : gera
uid                  : 501
gid                  : 501
uptime               : 22min
===============================================================================
--- CLI ----------------------------------------------------
local pm2            : 5.4.2
node version         : 20.16.0
node path            : /home/gera/.nvm/versions/node/v20.16.0/bin/pm2
argv                 : /home/gera/.nvm/versions/node/v20.16.0/bin/node,/home/gera/.nvm/versions/node/v20.16.0/bin/pm2,report
argv0                : node
user                 : gera
uid                  : 501
gid                  : 501
===============================================================================
--- System info --------------------------------------------
arch                 : arm64
platform             : linux
type                 : Linux
cpus                 : unknown
cpus nb              : 10
freemem              : 15780114432
totalmem             : 16742363136
home                 : /home/gera
===============================================================================
--- PM2 list -----------------------------------------------
┌────┬───────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id │ name      │ namespace   │ version │ mode    │ pid      │ uptime │ ↺    │ status    │ cpu      │ mem      │ user     │ watching │
├────┼───────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 0  │ server    │ default     │ 1.0.0   │ fork    │ 1027     │ 19m    │ 1    │ online    │ 0.8%     │ 49.3mb   │ gera     │ disabled │
└────┴───────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
===============================================================================
--- Daemon logs --------------------------------------------
/home/gera/.pm2/pm2.log last 20 lines:
PM2        | 2024-08-12T11:21:17: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:17: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:17: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:17: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:18: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:19: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:19: PM2 log: pid=995 msg=failed to kill - retrying in 100ms
PM2        | 2024-08-12T11:21:19: PM2 log: App [server:0] exited with code [0] via signal [SIGINT]
PM2        | 2024-08-12T11:21:19: PM2 log: pid=995 msg=process killed
PM2        | 2024-08-12T11:21:19: PM2 log: App [server:0] starting in -fork mode-
PM2        | 2024-08-12T11:21:19: PM2 log: App [server:0] online

wastikansari commented 3 months ago

This issue can occur due to several reasons, such as the timing of process shutdowns or the way PM2 saves the process list. Here’s how you can reproduce and potentially resolve this issue:

Set up multiple Node.js processes managed by PM2.
Install the default systemd startup script for PM2: pm2 startup systemd
Save the current process list pm2 save
Reboot the machine: sudo reboot

lpgera commented 3 months ago

@wastikansari Absolutely no offense, but are you generating comments with an LLM AI model?

Seems like your comment here, plus this, and this one are trying to be genuinely helpful but don't add too much to the conversation or are missing the point of the original question.

maganap commented 2 months ago

@lpgera Same situation as yours, and very similar issue I encounter here.

PM2@5.3.1, Debian 12. Node 20.11.0

I have 2 apps with about 30 instances each in cluster mode. Every once in a while, one of the processes is killed by OOM killer, then it all happens as you describe.

This just happened to us, notice the OOM killer starts at 14:30:28 and the pm2 dump is saved at 14:31:58, 90 seconds as you describe:

From journalctl:

Sep 04 14:30:28 pinche-mamey-2 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/pm2-root.service,task=cpdf,pid=1232147,uid=0 Sep 04 14:30:28 pinche-mamey-2 kernel: Out of memory: Killed process 1232147 (cpdf) total-vm:37012792kB, anon-rss:29360524kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:69420kB oom_score_adj:0 Sep 04 14:30:28 pinche-mamey-2 systemd[1]: pm2-root.service: A process of this unit has been killed by the OOM killer. Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: [PM2] Applying action deleteProcessId on app [all](ids: [ Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 60, 61, 62 Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: ]) ... ... ... Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Stopping timed out. Terminating. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Control process exited, code=killed, status=15/TERM Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Failed with result 'oom-kill'. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Consumed 1w 3d 14h 53min 30.446s CPU time. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Scheduled restart job, restart counter is at 1. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: Stopped pm2-root.service - PM2 process manager. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Consumed 1w 3d 14h 53min 30.446s CPU time. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: Starting pm2-root.service - PM2 process manager... Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Spawning PM2 daemon with pm2_home=/root/.pm2 Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] PM2 Successfully daemonized Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2][Module] Starting NPM module pm2-logrotate Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2][WARN] Applications pm2-logrotate not running, starting... Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] App [pm2-logrotate] launched (1 instances) Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Resurrecting Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Restoring processes located in /root/.pm2/dump.pm2 Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Process /opt/nubik/process-server/index.js restored Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Process /opt/nubik/print-server2/index.js restored ... ... Sep 04 14:31:59 pinche-mamey-2 systemd[1]: Started pm2-root.service - PM2 process manager.

So I go from ~60 processes down to just 2.

In pm2.log:

2024-09-04T14:30:28: PM2 log: Stopping app:pm2-logrotate id:0 ... ... ... 2024-09-04T14:31:58: PM2 log: pid=1224571 msg=failed to kill - retrying in 100ms 2024-09-04T14:31:58: PM2 log: pm2 has been killed by signal, dumping process list before exit...

90 seconds, right. Since I also have a longer 120 second timeout for my own processes in PM2, as I understand from your description, decreasing PM2 process timeout to, let's say, 60 seconds, would avoid PM2 dumping the process list? Just because it ends before getting killed by the system as well.

What I don't get is why PM2 is deleting all the processes at T0. I guess it wouldn't mind as long as PM2 is not killed by the system (and dump.pm2 overwritten).

Sep 04 14:30:28 pinche-mamey-2 systemd[1]: pm2-root.service: A process of this unit has been killed by the OOM killer. Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: [PM2] Applying action deleteProcessId on app [all](ids: [

I'll try it out, but I have no easy way to reproduce the OOM situation. I guess I can try a different forced scenario just for testing. Any suggestions are very much appreciated.

maganap commented 2 months ago

@lpgera Just to confirm, setting kill_timeout to 60 seconds does help avoiding this:

pm2 has been killed by signal, dumping process list before exit...

I still have the main issue about hitting OOM, but not losing the original dump file helps a lot. Thx

lpgera commented 2 months ago

I'll try it out, but I have no easy way to reproduce the OOM situation. I guess I can try a different forced scenario just for testing. Any suggestions are very much appreciated.

Yeah, triggering the OOM killer may not be the best way to debug this. I suggest fiddling with the timeout settings and creating a process that is slow to shut down on purpose. You'll be able to reproduce the exact same bug.

@lpgera Just to confirm, setting kill_timeout to 60 seconds does help avoiding this:

pm2 has been killed by signal, dumping process list before exit...

I still have the main issue about hitting OOM, but not losing the original dump file helps a lot. Thx

Decreasing pm2's kill_timeout is definitely a workaround that may work for some people, but my problem is that I want to have a longer grace period to ensure that my system is shut down in a consistent state, in case a long-running job is still in progress. And sometimes multiple processes may be slow to shut down, in which case killing them is fine, but losing the process list is not fine. 😬

So while tweaking kill_timeout may yield a solution for a specific setup, I still believe that there's a bug, and pm2 should never overwrite the dump file if a shutdown is already in progress.

maganap commented 2 months ago

I agree, definitively a bug and the dump file should never be overwritten this way.

In my case, our processes handle tasks that last from a few seconds to a few minutes. So having a grace period of 2 minutes was just to give more chance for them to finish. In case they don't, once they restart, they'll start over the same interrupted task. So, keeping kill_timeout at 2 min was not critical to me, I'd rather re-process some tasks than losing the dump file.

BTW, I just noticed we hit OOM again 3 hours ago... and nobody complained because nobody noticed 😄 PM2 went up again smoothly. I know it's just a workaround but it's saved us a few headaches already, so thank you again for mentioning this in your original post.

Unitech / pm2