MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.97k stars 523 forks source link

Mina OOM #10456

Open mnguyen-io opened 2 years ago

mnguyen-io commented 2 years ago

Preliminary Checks

Description

Prior to 1.3.0 we were running 1.2.2-feee67c as a seed now (we also run BPs but haven't tested on them).

After upgrading to 1.3.0 we're seeing OOM errors and crashes. Nothing has changed on our end aside from the update.

12 core/32 GB RAM/500GB NVMe drives

I've had to downgrade back to 1.2.2 and the crashes are gone.

dmesg:

[6028203.504587] mina invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0 [6028203.504588] mina cpuset=/ mems_allowed=0 [6028203.504591] CPU: 9 PID: 5338 Comm: mina Not tainted 4.15.0-156-generic #163-Ubuntu [6028203.504591] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C242D4U2-2T, BIOS L2.12D 12/23/2019 [6028203.504591] Call Trace: [6028203.504595] dump_stack+0x6d/0x8b [6028203.504597] dump_header+0x71/0x282 [6028203.504598] ? security_capable+0x51/0x70 [6028203.504600] oom_kill_process+0x21f/0x420 [6028203.504601] out_of_memory+0x116/0x4e0 [6028203.504602] alloc_pages_slowpath+0xa53/0xe00 [6028203.504604] ? alloc_pages_current+0x6a/0xe0 [6028203.504605] alloc_pages_nodemask+0x29a/0x2c0 [6028203.504606] alloc_pages_current+0x6a/0xe0 [6028203.504607] page_cache_alloc+0x81/0xa0 [6028203.504608] filemap_fault+0x42f/0x750 [6028203.504609] ? filemap_map_pages+0x181/0x390 [6028203.504611] ext4_filemap_fault+0x31/0x50 [6028203.504612] __do_fault+0x34/0x130 [6028203.504613] handle_mm_fault+0xba1/0xff0 [6028203.504614] handle_mm_fault+0xe7/0x260 [6028203.504616] __do_page_fault+0x281/0x4b0 [6028203.504617] do_page_fault+0x2e/0xe0 [6028203.504618] ? page_fault+0x2f/0x50 [6028203.504619] page_fault+0x45/0x50 [6028203.504620] RIP: 0033:0x7fc54451ad1f [6028203.504620] RSP: 002b:00007fc53cbfd7e0 EFLAGS: 00010293 [6028203.504621] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fc54451ad1f [6028203.504621] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [6028203.504622] RBP: 0000000000000000 R08: 00007fc53cbfd820 R09: 0000000000000000 [6028203.504622] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 [6028203.504623] R13: 0000000000000000 R14: 00007fc53cbfd820 R15: 00007ffda9e80690 [6028203.504623] Mem-Info: [6028203.504625] active_anon:7552535 inactive_anon:445018 isolated_anon:0 active_file:190 inactive_file:20 isolated_file:0 unevictable:0 dirty:1 writeback:27 unstable:0 slab_reclaimable:21114 slab_unreclaimable:35744 mapped:112 shmem:3 pagetables:33979 bounce:0 free:50716 free_pcp:0 free_cma:0 [6028203.504627] Node 0 active_anon:30210140kB inactive_anon:1780072kB active_file:760kB inactive_file:80kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:448kB dirty:4kB writeback:108kB shmem:12kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3215360kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes [6028203.504627] Node 0 DMA free:15896kB min:32kB low:44kB high:56kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [6028203.504629] lowmem_reserve[]: 0 1954 31860 31860 31860 [6028203.504630] Node 0 DMA32 free:123692kB min:4140kB low:6140kB high:8140kB active_anon:1797204kB inactive_anon:136392kB active_file:220kB inactive_file:156kB unevictable:0kB writepending:80kB present:2128184kB managed:2060744kB mlocked:0kB kernel_stack:0kB pagetables:644kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [6028203.504631] lowmem_reserve[]: 0 0 29905 29905 29905 [6028203.504632] Node 0 Normal free:63276kB min:63404kB low:94024kB high:124644kB active_anon:28414604kB inactive_anon:1640616kB active_file:740kB inactive_file:0kB unevictable:0kB writepending:280kB present:31178752kB managed:30629684kB mlocked:0kB kernel_stack:6832kB pagetables:135272kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [6028203.504634] lowmem_reserve[]: 0 0 0 0 0 [6028203.504634] Node 0 DMA: 24kB (U) 28kB (U) 216kB (U) 132kB (U) 364kB (U) 2128kB (U) 0256kB 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15896kB [6028203.504638] Node 0 DMA32: 744kB (UME) 848kB (UME) 14516kB (UME) 16732kB (UME) 12364kB (UME) 76128kB (UME) 34256kB (UME) 19512kB (UME) 741024kB (UME) 22048kB (M) 04096kB = 124536kB [6028203.504641] Node 0 Normal: 2994kB (UM) 5418kB (UM) 25816kB (UM) 50932kB (UM) 17064kB (UM) 61128kB (UM) 35256kB (UM) 16512kB (UM) 31024kB (M) 02048kB 0*4096kB = 64852kB [6028203.504644] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [6028203.504644] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [6028203.504645] 943 total pagecache pages [6028203.504646] 416 pages in swap cache [6028203.504647] Swap cache stats: add 104069820, delete 104123295, find 61616366/92480408 [6028203.504647] Free swap = 0kB [6028203.504647] Total swap = 17823732kB [6028203.504648] 8330732 pages RAM [6028203.504648] 0 pages HighMem/MovableOnly [6028203.504648] 154151 pages reserved [6028203.504648] 0 pages cma reserved [6028203.504649] 0 pages hwpoisoned [6028203.504649] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [6028203.504654] [ 722] 0 722 96030 63 741376 2998 0 systemd-journal [6028203.504656] [ 808] 0 808 26478 0 94208 78 0 lvmetad [6028203.504657] [ 817] 0 817 11430 0 118784 347 -1000 systemd-udevd [6028203.504658] [ 1041] 0 1041 1878 0 61440 91 0 mdadm [6028203.504659] [ 1175] 62583 1175 35447 0 184320 146 0 systemd-timesyn [6028203.504659] [ 1294] 100 1294 17930 0 172032 156 0 systemd-network [6028203.504660] [ 1358] 101 1358 17624 0 180224 136 0 systemd-resolve [6028203.504661] [ 1372] 111 1372 264199 0 405504 11868 0 named [6028203.504661] [ 1403] 0 1403 7509 0 106496 79 0 cron [6028203.504662] [ 1410] 102 1410 65761 0 159744 558 0 rsyslogd [6028203.504663] [ 1442] 0 1442 18206 0 180224 191 0 systemd-logind [6028203.504664] [ 1463] 103 1463 12589 1 143360 230 -900 dbus-daemon [6028203.504664] [ 1533] 0 1533 42276 1 221184 1948 0 networkd-dispat [6028203.504665] [ 1653] 0 1653 27642 0 110592 121 0 irqbalance [6028203.504666] [ 1727] 0 1727 7085 0 106496 65 0 atd [6028203.504666] [ 1737] 0 1737 151073 0 143360 372 0 lxcfs [6028203.504667] [ 1763] 0 1763 72305 0 200704 424 0 accounts-daemon [6028203.504667] [ 1797] 115 1797 79941 0 339968 525 -900 postgres [6028203.504668] [ 1804] 0 1804 46488 0 258048 1999 0 unattended-upgr [6028203.504669] [ 1862] 115 1862 79941 0 307200 528 0 postgres [6028203.504670] [ 1863] 115 1863 79941 0 319488 508 0 postgres

[6028203.504670] [ 1864] 115 1864 79941 0 311296 515 0 postgres [6028203.504671] [ 1866] 115 1866 80076 0 319488 600 0 postgres [6028203.504671] [ 1867] 115 1867 43530 0 294912 541 0 postgres [6028203.504672] [ 1869] 115 1869 80080 0 311296 567 0 postgres [6028203.504673] [ 2009] 114 2009 1293432 355 1126400 15928 0 agent [6028203.504673] [ 2010] 114 2010 529439 156 446464 5918 0 process-agent [6028203.504674] [ 2013] 114 2013 522383 179 405504 5187 0 trace-agent [6028203.504675] [ 2208] 0 2208 3668 0 69632 34 0 agetty [6028203.504675] [ 2209] 0 2209 3668 0 73728 35 0 agetty [6028203.504676] [ 2216] 0 2216 3724 0 69632 37 0 agetty [6028203.504677] [ 2410] 0 2410 16846 0 118784 138 0 master [6028203.504677] [ 2414] 113 2414 18486 0 135168 189 0 qmgr [6028203.504678] [ 4619] 0 4619 51224 4 409600 454 0 sssd [6028203.504679] [ 4633] 0 4633 65069 0 528384 1547 0 sssd_be [6028203.504679] [ 4643] 0 4643 49993 0 421888 370 0 sssd_nss [6028203.504680] [ 4645] 0 4645 45396 0 389120 364 0 sssd_pam [6028203.504680] [ 4646] 0 4646 46679 0 393216 390 0 sssd_ssh [6028203.504681] [ 4648] 0 4648 43301 0 364544 368 0 sssd_sudo [6028203.504682] [ 5750] 0 5750 18077 0 184320 210 -1000 sshd [6028203.504682] [18716] 0 18716 72222 0 212992 289 0 polkitd [6028203.504683] [29211] 0 29211 36817 0 299008 276 0 sshd [6028203.504684] [29338] 1655987419 29338 36863 0 290816 327 0 sshd [6028203.504685] [29339] 1655987419 29339 9561 1 110592 415 0 bash [6028203.504685] [29351] 0 29351 25382 0 212992 144 0 sudo [6028203.504686] [29352] 0 29352 23692 0 204800 134 0 su [6028203.504687] [29353] 0 29353 19200 1 188416 305 0 systemd [6028203.504687] [29354] 0 29354 67437 0 294912 734 0 (sd-pam) [6028203.504688] [29366] 0 29366 5375 1 86016 421 0 bash [6028203.504689] [ 1973] 113 1973 21516 0 143360 141 0 pickup [6028203.504689] [ 5320] 1004 5320 1437316 614562 11153408 63916 0 mina [6028203.504690] [ 5474] 1004 5474 667242 127 5271552 544846 0 exe [6028203.504691] [ 5623] 1004 5623 2652799 35003 20897792 364706 0 exe [6028203.504691] [ 5684] 1004 5684 66142 136 495616 11039 0 exe [6028203.504693] [13686] 1004 13686 11826805 7345052 88743936 3410845 0 coda-libp2p_hel [6028203.504693] [15082] 0 15082 36793 0 299008 276 0 sshd [6028203.504694] [15206] 1655987419 15206 36822 5 290816 291 0 sshd [6028203.504695] [15207] 1655987419 15207 9561 1 102400 415 0 bash [6028203.504695] [15228] 0 15228 25382 1 212992 144 0 sudo [6028203.504696] [15229] 0 15229 23692 0 196608 134 0 su [6028203.504696] [15230] 0 15230 5374 0 90112 438 0 bash [6028203.504697] Out of memory: Kill process 13686 (coda-libp2p_hel) score 853 or sacrifice child [6028203.608533] Killed process 13686 (coda-libp2p_hel) total-vm:47307220kB, anon-rss:29380208kB, file-rss:0kB, shmem-rss:0kB

coda_crash_report_2022-03-10_13-13-03.530653.tar.gz

Steps to Reproduce

upgrade node - wait to see if it comes up.

Expected Result

No crash

Actual Result

oom crash/kill

[6028203.504697] Out of memory: Kill process 13686 (coda-libp2p_hel) score 853 or sacrifice child [6028203.608533] Killed process 13686 (coda-libp2p_hel) total-vm:47307220kB, anon-rss:29380208kB, file-rss:0kB, shmem-rss:0kB

How frequently do you see this issue?

Frequently

What is the impact of this issue on your ability to run a node?

High

Status

Mina daemon status
-----------------------------------

Max observed block height:              116394
Max observed unvalidated block height:  0
Local uptime:                           14m5s
Chain id:                               5f704cc0c82e0ed70e873f0893d7e06f148524e3f0bdae2afb02e7819a0c24d1
Git SHA-1:                              9b0369c27bb85c8ab2f8725c6e977eb27b53b826
Configuration directory:                /home/mina/.mina-config
Peers:                                  84
User_commands sent:                     0
SNARK worker:                           None
SNARK work fee:                         100000000
Sync status:                            Bootstrap
Block producers running:                0
Coinbase receiver:                      Block producer
Consensus time now:                     epoch=24, slot=732
Consensus mechanism:                    proof_of_stake
Consensus configuration:                
        Delta:                     0
        k:                         290
        Slots per epoch:           7140
        Slot duration:             3m
        Epoch duration:            14d21h
        Chain start timestamp:     2021-03-17 00:00:00.000000Z
        Acceptable network delay:  3m

Addresses and ports:                    
        External IP:    XXXXX
        Bind IP:        0.0.0.0
        Libp2p PeerID:  12D3KooWSkfwArLtqGMht1a9w3z3QiiqA2E6seBRAk378rvanGRZ
        Libp2p port:    8302
        Client port:    8301

Additional information

No response

nholland94 commented 2 years ago

Hi @mnguyen-io. I took a look at the logs, and it looks like the libp2p_helper process is the one crashing due to OOM. I'm not sure how you are deploying, but just for a sanity check, could you send the output of md5sum $(which libp2p_helper) so that I can confirm it is the correct distribution? Also, if you can recreate the issue and upload all the *.log file from ~/.mina-config, that would provide a lot more information to debug this issue, as the crash log only contains the most recent few lines leading up to the crash.

mnguyen-io commented 2 years ago

Hi - thank you for replying. I'll reinstall the latest mina again shortly and report back.

$ md5sum /usr/local/bin/coda-libp2p_helper 45f5cfb20b94a3374a3007f8be35de48 /usr/local/bin/coda-libp2p_helper

Nothing special about how we're deploying, just via apt install. No docker containers here.

mnguyen-io commented 2 years ago

mina_logs.tar.gz mina_crash.tar.gz

Hi - just reinstalled. Please see new crash logs and mina logs

Szymongib commented 1 year ago

We are running upstream docker images and are seeing similar issue:

09:18:08 UTC [Fatal] libp2p_helper process died unexpectedly: "died after receiving sigkill (signal number 9)"
Out of memory: Killed process 2776276 (coda-libp2p_hel) total-vm:129404584kB, anon-rss:125029196kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:247992kB oom_score_adj:0

Image version is minaprotocol/mina-daemon:1.3.1.1-f361ba1-bullseye-mainnet

winlin commented 1 year ago

we are run mina node with 8CPU & 16 G RAM, often got OOM:

Nov 27 01:00:08  kernel: [41991323.132296] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/supervisor.service,task=mina,pid=2617814,uid=0
Nov 27 01:00:08  kernel: [41991323.132370] Out of memory: Killed process 2617814 (mina) total-vm:10924232kB, anon-rss:5752348kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:16404kB oom_score_adj:0
Nov 27 01:00:08  kernel: [41991323.290090] oom_reaper: reaped process 2617814 (mina), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Nov 27 01:00:08  supervisord[6974]: 2022-11-27 01:00:08,497 INFO exited: minabp (terminated by SIGKILL; not expected)
Nov 27 01:00:09  supervisord[6974]: 2022-11-27 01:00:09,562 INFO spawned: 'minabp' with pid 2633471
Nov 27 01:01:00  supervisord[6974]: 2022-11-27 01:01:00,088 INFO success: minabp entered RUNNING state, process has stayed up for > than 50 seconds (startsecs)
Nov 27 01:05:01  CRON[2633805]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
YuXiaoCoder commented 1 year ago

I have the same problem