Closed NorseGaud closed 2 years ago
Might be speaking too soon, but if I keep the backup blocks/chainstate I restore from up to date (speeding up the time the new replacement instance takes to get to MASTERNODE_SYNC_FINISHED), it seems to be fine.
Ok, still happens 3% of the time even if the blocks/etc dirs are up to date and firod doesn't take long to sync. It seems random too, as the node is somewhat low usage when the problem starts occurring.
Even with SSDs instead of magnetic drives and low CPU usage, the firod service is restarting every so often
grep "Firo version" .firo/debug.log
2022-07-31 05:45:23 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 08:49:42 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:16:22 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:20:21 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:56:45 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:58:51 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 12:03:37 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 12:28:37 Firo version v0.14.11.1-g1e9bad81f
It seems the arm binary is just senstive... I migrated from amd64 to arm and changed nothing else.
"date; grep 'Firo version' .firo/debug.log"
WARNING: This is a private, closely monitored server. Unless you are the only authorized user, your actions will be traced and reported to the appropriate authorities.
Sun Jul 31 13:43:28 UTC 2022
2022-07-31 05:38:30 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 12:56:40 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 13:40:54 Firo version v0.14.11.1-g1e9bad81f
I can't imagine such a small cpu spike would cause this...
It was oom-killer on the host... Somehow it got turned back on when I migrate to arm.
sudo grep oom-kill /var/log/messages
Jul 31 12:56:10 ip-172-31-13-30 kernel: yum invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
Jul 31 12:56:10 ip-172-31-13-30 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=firod,pid=2211,uid=1000
Jul 31 13:40:23 ip-172-31-13-30 kernel: kthreadd invoked oom-killer: gfp_mask=0x2dc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0
Jul 31 13:40:23 ip-172-31-13-30 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=firod,pid=7645,uid=1000
Jul 31 14:12:38 ip-172-31-13-30 kernel: firo-net invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Jul 31 14:12:38 ip-172-31-13-30 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=firod,pid=8315,uid=1000
I was able to prevent this from happening with sysctl changes:
vm.swappiness = 1
vm.overcommit_memory = 2
vm.oom-kill = 0
Describe the issue
I'm migrating masternodes from intel to arm and follow this process:
At this point, things look good. I see
MASTERNODE_SYNC_FINISHED
, aReady
state, no PoSePenality, etc. I even check the quorum and it has what I expected.However, a few hours later the MN goes into a restart loop.
Here is my debug.log: debug.log
Can you reliably reproduce the issue?
Yes, see above.
Expected behaviour
It shouldn't restart
Actual behaviour
It restarts
Machine specs & versions: