Open lpgera opened 3 months ago
This issue can occur due to several reasons, such as the timing of process shutdowns or the way PM2 saves the process list. Here’s how you can reproduce and potentially resolve this issue:
@lpgera Same situation as yours, and very similar issue I encounter here.
PM2@5.3.1, Debian 12. Node 20.11.0
I have 2 apps with about 30 instances each in cluster mode. Every once in a while, one of the processes is killed by OOM killer, then it all happens as you describe.
This just happened to us, notice the OOM killer starts at 14:30:28 and the pm2 dump is saved at 14:31:58, 90 seconds as you describe:
From journalctl
:
Sep 04 14:30:28 pinche-mamey-2 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/pm2-root.service,task=cpdf,pid=1232147,uid=0 Sep 04 14:30:28 pinche-mamey-2 kernel: Out of memory: Killed process 1232147 (cpdf) total-vm:37012792kB, anon-rss:29360524kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:69420kB oom_score_adj:0 Sep 04 14:30:28 pinche-mamey-2 systemd[1]: pm2-root.service: A process of this unit has been killed by the OOM killer. Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: [PM2] Applying action deleteProcessId on app [all](ids: [ Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: 60, 61, 62 Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: ]) ... ... ... Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Stopping timed out. Terminating. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Control process exited, code=killed, status=15/TERM Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Failed with result 'oom-kill'. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Consumed 1w 3d 14h 53min 30.446s CPU time. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Scheduled restart job, restart counter is at 1. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: Stopped pm2-root.service - PM2 process manager. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: pm2-root.service: Consumed 1w 3d 14h 53min 30.446s CPU time. Sep 04 14:31:58 pinche-mamey-2 systemd[1]: Starting pm2-root.service - PM2 process manager... Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Spawning PM2 daemon with pm2_home=/root/.pm2 Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] PM2 Successfully daemonized Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2][Module] Starting NPM module pm2-logrotate Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2][WARN] Applications pm2-logrotate not running, starting... Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] App [pm2-logrotate] launched (1 instances) Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Resurrecting Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Restoring processes located in /root/.pm2/dump.pm2 Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Process /opt/nubik/process-server/index.js restored Sep 04 14:31:59 pinche-mamey-2 pm2[1234310]: [PM2] Process /opt/nubik/print-server2/index.js restored ... ... Sep 04 14:31:59 pinche-mamey-2 systemd[1]: Started pm2-root.service - PM2 process manager.
So I go from ~60 processes down to just 2.
In pm2.log:
2024-09-04T14:30:28: PM2 log: Stopping app:pm2-logrotate id:0 ... ... ... 2024-09-04T14:31:58: PM2 log: pid=1224571 msg=failed to kill - retrying in 100ms 2024-09-04T14:31:58: PM2 log: pm2 has been killed by signal, dumping process list before exit...
90 seconds, right. Since I also have a longer 120 second timeout for my own processes in PM2, as I understand from your description, decreasing PM2 process timeout to, let's say, 60 seconds, would avoid PM2 dumping the process list? Just because it ends before getting killed by the system as well.
What I don't get is why PM2 is deleting all the processes at T0. I guess it wouldn't mind as long as PM2 is not killed by the system (and dump.pm2 overwritten).
Sep 04 14:30:28 pinche-mamey-2 systemd[1]: pm2-root.service: A process of this unit has been killed by the OOM killer. Sep 04 14:30:28 pinche-mamey-2 pm2[1232957]: [PM2] Applying action deleteProcessId on app [all](ids: [
I'll try it out, but I have no easy way to reproduce the OOM situation. I guess I can try a different forced scenario just for testing. Any suggestions are very much appreciated.
@lpgera Just to confirm, setting kill_timeout
to 60 seconds does help avoiding this:
pm2 has been killed by signal, dumping process list before exit...
I still have the main issue about hitting OOM, but not losing the original dump file helps a lot. Thx
I'll try it out, but I have no easy way to reproduce the OOM situation. I guess I can try a different forced scenario just for testing. Any suggestions are very much appreciated.
Yeah, triggering the OOM killer may not be the best way to debug this. I suggest fiddling with the timeout settings and creating a process that is slow to shut down on purpose. You'll be able to reproduce the exact same bug.
@lpgera Just to confirm, setting
kill_timeout
to 60 seconds does help avoiding this:pm2 has been killed by signal, dumping process list before exit...
I still have the main issue about hitting OOM, but not losing the original dump file helps a lot. Thx
Decreasing pm2's kill_timeout
is definitely a workaround that may work for some people, but my problem is that I want to have a longer grace period to ensure that my system is shut down in a consistent state, in case a long-running job is still in progress. And sometimes multiple processes may be slow to shut down, in which case killing them is fine, but losing the process list is not fine. 😬
So while tweaking kill_timeout
may yield a solution for a specific setup, I still believe that there's a bug, and pm2 should never overwrite the dump file if a shutdown is already in progress.
I agree, definitively a bug and the dump file should never be overwritten this way.
In my case, our processes handle tasks that last from a few seconds to a few minutes. So having a grace period of 2 minutes was just to give more chance for them to finish. In case they don't, once they restart, they'll start over the same interrupted task.
So, keeping kill_timeout
at 2 min was not critical to me, I'd rather re-process some tasks than losing the dump file.
BTW, I just noticed we hit OOM again 3 hours ago... and nobody complained because nobody noticed 😄 PM2 went up again smoothly. I know it's just a workaround but it's saved us a few headaches already, so thank you again for mentioning this in your original post.
What's going wrong?
I have several Node.js processes managed by pm2, and I've installed the default systemd startup script. When I reboot the machine, pm2 starts up with an incomplete list of processes.
Related to https://github.com/Unitech/pm2/issues/3865.
How could we reproduce this issue?
I managed to narrow down the issue to the following factors:
TimeoutStopUSec=1min 30s
)pm2 has been killed by signal, dumping process list before exit...
)I believe #3865 was supposed to solve this issue, but apparently it didn't cover all the cases. For my use-case it'd be nice to have the option to never dump the process list during a shutdown.
Supporting information
I tried with the latest 5.4.2 version of pm2 on Ubuntu 24.04 LTS with Node.js 20.16.0.