firoorg / firo

The privacy-focused cryptocurrency
https://firo.org
MIT License
720 stars 355 forks source link

Masternode restarting loop #1197

Closed NorseGaud closed 2 years ago

NorseGaud commented 2 years ago

Describe the issue

I'm migrating masternodes from intel to arm and follow this process:

  1. Start the replacement arm instance alongside the old one.
  2. Install all of the tools onto the new instance.
  3. Install firo onto instance, allow it to start with default config for 10 seconds, and then stop it.
  4. Copy blocks, chainstate, database, evodb, llmq, and firo.conf from old running masternode/instance into new.
  5. Start firod on new/arm instance and wait 120 seconds for it to boot and start catching up.
  6. I then swap the IPs moving the old running masternode/instance IP to the new instance.

At this point, things look good. I see MASTERNODE_SYNC_FINISHED, a Ready state, no PoSePenality, etc. I even check the quorum and it has what I expected.

However, a few hours later the MN goes into a restart loop.

Here is my debug.log: debug.log

Can you reliably reproduce the issue?

Yes, see above.

Expected behaviour

It shouldn't restart

Actual behaviour

It restarts

Machine specs & versions:

NorseGaud commented 2 years ago

Might be speaking too soon, but if I keep the backup blocks/chainstate I restore from up to date (speeding up the time the new replacement instance takes to get to MASTERNODE_SYNC_FINISHED), it seems to be fine.

NorseGaud commented 2 years ago

Ok, still happens 3% of the time even if the blocks/etc dirs are up to date and firod doesn't take long to sync. It seems random too, as the node is somewhat low usage when the problem starts occurring.

NorseGaud commented 2 years ago

Even with SSDs instead of magnetic drives and low CPU usage, the firod service is restarting every so often

grep "Firo version" .firo/debug.log
2022-07-31 05:45:23 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 08:49:42 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:16:22 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:20:21 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:56:45 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 11:58:51 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 12:03:37 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 12:28:37 Firo version v0.14.11.1-g1e9bad81f

It seems the arm binary is just senstive... I migrated from amd64 to arm and changed nothing else.

NorseGaud commented 2 years ago
Screen Shot 2022-07-31 at 9 50 03 AM
"date; grep 'Firo version' .firo/debug.log"
WARNING: This is a private, closely monitored server.  Unless you are the only authorized user, your actions will be traced and reported to the appropriate authorities.
Sun Jul 31 13:43:28 UTC 2022
2022-07-31 05:38:30 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 12:56:40 Firo version v0.14.11.1-g1e9bad81f
2022-07-31 13:40:54 Firo version v0.14.11.1-g1e9bad81f

I can't imagine such a small cpu spike would cause this...

NorseGaud commented 2 years ago

It was oom-killer on the host... Somehow it got turned back on when I migrate to arm.

sudo grep oom-kill /var/log/messages
Jul 31 12:56:10 ip-172-31-13-30 kernel: yum invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
Jul 31 12:56:10 ip-172-31-13-30 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=firod,pid=2211,uid=1000
Jul 31 13:40:23 ip-172-31-13-30 kernel: kthreadd invoked oom-killer: gfp_mask=0x2dc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_ZERO), order=0, oom_score_adj=0
Jul 31 13:40:23 ip-172-31-13-30 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=firod,pid=7645,uid=1000
Jul 31 14:12:38 ip-172-31-13-30 kernel: firo-net invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
Jul 31 14:12:38 ip-172-31-13-30 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=firod,pid=8315,uid=1000

I was able to prevent this from happening with sysctl changes:

vm.swappiness = 1
vm.overcommit_memory = 2
vm.oom-kill = 0