cyberFund / cybernode-archive

🚀 Manager of docker images for cybernomics
MIT License
19 stars 4 forks source link

Repeating Mars failures #114

Closed abitrolly closed 6 years ago

abitrolly commented 6 years ago

UPDATE: Native AMD drivers seem to make the system more stable. Ticket can be closed on 20 days uptime.

sudo less /var/log/syslog

Nov 28 14:15:04 mars kernel: [99703.337210] traps: Verifier #1[18176] general protection ip:7efd23d3b177 sp:7efd191f9610 error:0 in libc-2.19.so[7efd23d01000+1be000]
Nov 28 14:15:04 mars kernel: [99703.494453] BUG: unable to handle kernel paging request at ffff94000a937d68
Nov 28 14:15:04 mars kernel: [99703.494850] IP: jbd2_journal_grab_journal_head+0x9/0x40
Nov 28 14:15:04 mars kernel: [99703.495231] PGD afa123067 
Nov 28 14:15:04 mars kernel: [99703.495232] P4D afa123067 
Nov 28 14:15:04 mars kernel: [99703.495635] PUD 0 
Nov 28 14:15:04 mars kernel: [99703.496006] 
Nov 28 14:15:04 mars kernel: [99703.496746] Oops: 0002 [#1] SMP
Nov 28 14:15:04 mars kernel: [99703.497130] Modules linked in: veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conn
track_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc overlay snd_hda_codec_realtek snd_hda_codec_generic snd_h
da_codec_hdmi edac_mce_amd snd_hda_intel kvm_amd snd_hda_codec snd_hda_core kvm snd_hwdep snd_pcm irqbypass crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel pcb
c snd_rawmidi snd_seq snd_seq_device aesni_intel snd_timer eeepc_wmi aes_x86_64 asus_wmi snd crypto_simd glue_helper sparse_keymap cryptd video wmi_bmof serio_raw soundcore ccp i2c_piix4 input_
leds joydev shpchp 8250_dw mac_hid parport_pc ppdev lp parport ip_tables
Nov 28 14:15:04 mars kernel: [99703.500025]  x_tables autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu mxm_wmi ttm drm_kms_helper syscopyarea sysfillrect sysimgblt psmouse fb_sys_fops 
igb drm dca i2c_algo_bit ahci ptp pps_core nvme libahci nvme_core gpio_amdpt wmi gpio_generic
Nov 28 14:15:04 mars kernel: [99703.501525] CPU: 7 PID: 128 Comm: kswapd0 Not tainted 4.13.0-17-generic #20-Ubuntu
Nov 28 14:15:04 mars kernel: [99703.502036] Hardware name: System manufacturer System Product Name/CROSSHAIR VI HERO, BIOS 1701 09/22/2017
Nov 28 14:15:04 mars kernel: [99703.502561] task: ffff94110c8a9740 task.stack: ffffb4a686ea4000
Nov 28 14:15:04 mars kernel: [99703.503086] RIP: 0010:jbd2_journal_grab_journal_head+0x9/0x40
Nov 28 14:15:04 mars kernel: [99703.503615] RSP: 0018:ffffb4a686ea79d0 EFLAGS: 00010206
Nov 28 14:15:04 mars kernel: [99703.504145] RAX: 0000000000000000 RBX: ffff94000a937d68 RCX: 0000000000000000
Nov 28 14:15:04 mars kernel: [99703.504694] RDX: 0000000000000000 RSI: ffffe928d43ccd00 RDI: ffff94000a937d68
Nov 28 14:15:04 mars kernel: [99703.505280] RBP: ffffb4a686ea79d0 R08: 0000000000020120 R09: 0000000000000010
Nov 28 14:15:04 mars kernel: [99703.505762] R10: ffff940b28b78700 R11: 0000000000000011 R12: 0000000000000000
Nov 28 14:15:04 mars kernel: [99703.506285] R13: ffff94080a937d68 R14: ffffe928d43ccd00 R15: ffff941108923388
Nov 28 14:15:04 mars kernel: [99703.506812] FS:  0000000000000000(0000) GS:ffff94111e7c0000(0000) knlGS:0000000000000000
Nov 28 14:15:04 mars kernel: [99703.507346] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 28 14:15:04 mars kernel: [99703.507904] CR2: ffff94000a937d68 CR3: 00000003e19bc000 CR4: 00000000003406e0
Nov 28 14:15:04 mars kernel: [99703.508448] Call Trace:
Nov 28 14:15:04 mars kernel: [99703.509049]  jbd2_journal_try_to_free_buffers+0x90/0x110
Nov 28 14:15:04 mars kernel: [99703.509543]  ext4_releasepage+0x52/0xb0
Nov 28 14:15:04 mars kernel: [99703.510060]  try_to_release_page+0x41/0x50
Nov 28 14:15:04 mars kernel: [99703.510608]  invalidate_inode_page+0x66/0x80
Nov 28 14:15:04 mars kernel: [99703.511155]  invalidate_mapping_pages+0x129/0x2a0
Nov 28 14:15:04 mars kernel: [99703.511715]  inode_lru_isolate+0x131/0x180
Nov 28 14:15:04 mars kernel: [99703.512259]  __list_lru_walk_one.isra.5+0x8c/0x130
Nov 28 14:15:04 mars kernel: [99703.512842]  ? iput+0x220/0x220
Nov 28 14:15:04 mars kernel: [99703.513380]  list_lru_walk_one+0x23/0x30
Nov 28 14:15:04 mars kernel: [99703.513874]  prune_icache_sb+0x4f/0x80
Nov 28 14:15:04 mars kernel: [99703.514367]  super_cache_scan+0x134/0x1b0
Nov 28 14:15:04 mars kernel: [99703.514859]  shrink_slab.part.48+0x1d6/0x3d0
Nov 28 14:15:04 mars kernel: [99703.515351]  shrink_slab+0x1b/0x30
Nov 28 14:15:04 mars kernel: [99703.515874]  shrink_node+0x11e/0x300
Nov 28 14:15:04 mars kernel: [99703.516410]  kswapd+0x2cc/0x750
Nov 28 14:15:04 mars kernel: [99703.516942]  kthread+0x125/0x140
Nov 28 14:15:04 mars kernel: [99703.517471]  ? mem_cgroup_shrink_node+0x180/0x180
Nov 28 14:15:04 mars kernel: [99703.518028]  ? kthread_create_on_node+0x70/0x70
Nov 28 14:15:04 mars kernel: [99703.518549]  ret_from_fork+0x25/0x30
Nov 28 14:15:04 mars kernel: [99703.519066] Code: 48 89 c1 31 c0 eb c6 48 c7 c6 60 11 65 ad 48 c7 c7 a4 2c 8d ad e8 a9 91 d9 ff e9 6b ff ff ff 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 <f0> 0f ba 2f 18 72 19 48 8b 07 a9 00 00 02 00 74 1d 48 8b 47 40 
Nov 28 14:15:04 mars kernel: [99703.520186] RIP: jbd2_journal_grab_journal_head+0x9/0x40 RSP: ffffb4a686ea79d0
Nov 28 14:15:04 mars kernel: [99703.520752] CR2: ffff94000a937d68
Nov 28 14:15:04 mars kernel: [99703.521319] ---[ end trace 768b15826ea4e161 ]---
abitrolly commented 6 years ago

Need to run fsck when server is free. memtest from month ago didn't reveal any problems.

abitrolly commented 6 years ago

Mars was down again. /var/log/syslog.1:

Dec  3 11:51:16 mars systemd-timesyncd[914]: Timed out waiting for reply from 91.189.89.199:123 (ntp.ubuntu.com).
Dec  3 11:51:26 mars systemd-timesyncd[914]: Timed out waiting for reply from 91.189.94.4:123 (ntp.ubuntu.com).
Dec  3 11:53:39 mars NetworkManager[1153]: <info>  [1512291219.3707] connectivity: (enp35s0) response shorter than expected 'NetworkManager is online'; assuming captive portal.
Dec  3 11:58:39 mars NetworkManager[1153]: <info>  [1512291519.2776] connectivity: (enp35s0) response shorter than expected 'NetworkManager is online'; assuming captive portal.
Dec  3 12:02:19 mars systemd[1]: Started Run anacron jobs.
Dec  3 12:02:19 mars anacron[29104]: Anacron 2.3 started on 2017-12-03
Dec  3 12:02:19 mars anacron[29104]: Normal exit (0 jobs run)
Dec  3 12:03:39 mars NetworkManager[1153]: <info>  [1512291819.3449] connectivity: (enp35s0) response shorter than expected 'NetworkManager is online'; assuming captive portal.
Dec  3 12:08:39 mars NetworkManager[1153]: <info>  [1512292119.3716] connectivity: (enp35s0) response shorter than expected 'NetworkManager is online'; assuming captive portal.
Dec  3 12:09:57 mars colord[1199]: failed to get session [pid 27963]: No data available
Dec  3 12:13:39 mars NetworkManager[1153]: <info>  [1512292419.3713] connectivity: (enp35s0) response shorter than expected 'NetworkManager is online'; assuming captive portal.
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Dec  4 11:18:47 mars rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1092" x-info="http://www.rsyslog.com"] start
Dec  4 11:18:47 mars systemd-modules-load[389]: Inserted module 'lp'
Dec  4 11:18:47 mars systemd-modules-load[389]: Inserted module 'ppdev'
Dec  4 11:18:47 mars systemd-modules-load[389]: Inserted module 'parport_pc'
Dec  4 11:18:47 mars keyboard-setup.sh[388]: cannot open file /tmp/tmpkbd.vaj1SS
Dec  4 11:18:47 mars systemd[1]: Started udev Kernel Device Manager.
Dec  4 11:18:47 mars systemd[1]: Starting Remount Root and Kernel File Systems...
Dec  4 11:18:47 mars systemd[1]: Started Remount Root and Kernel File Systems.

@litvintech @hleb-albau if that repeats, plz. add more data to this issue.

abitrolly commented 6 years ago

We need a solution for server monitoring.

https://github.com/etsy/statsd seems to be what Wargaming web team was talking about on the last Minsk Python Meetup.

abitrolly commented 6 years ago
Dec 19 17:52:27 mars kernel: [ 5210.729762] general protection fault: 0000 [#2] SMP
Dec 19 17:52:27 mars kernel: [ 5210.731261] Modules linked in: btrfs xor raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs veth cfg80211 xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_i
pv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c br_netfilte
r bridge stp llc overlay edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek ghash_clmulni_intel pcbc snd_hda_codec_generic aesni_intel snd_hda_codec_hdmi aes
_x86_64 crypto_simd glue_helper cryptd snd_hda_intel snd_seq_midi snd_seq_midi_event snd_hda_codec snd_rawmidi snd_hda_core eeepc_wmi asus_wmi snd_hwdep sparse_keymap serio_raw video snd_seq wm
i_bmof snd_pcm snd_seq_device i2c_piix4 snd_timer ccp snd soundcore
Dec 19 17:52:27 mars kernel: [ 5210.737824]  joydev input_leds shpchp 8250_dw mac_hid parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu mx
m_wmi ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb psmouse dca i2c_algo_bit ptp ahci nvme pps_core libahci nvme_core gpio_amdpt wmi gpio_generic
Dec 19 17:52:27 mars kernel: [ 5210.741425] CPU: 8 PID: 25302 Comm: bitcoin-httpwor Tainted: G      D         4.13.0-19-generic #22-Ubuntu
Dec 19 17:52:27 mars kernel: [ 5210.743185] Hardware name: System manufacturer System Product Name/CROSSHAIR VI HERO, BIOS 1701 09/22/2017
Dec 19 17:52:27 mars kernel: [ 5210.744953] task: ffff94f40b2645c0 task.stack: ffffa4f68e9dc000
Dec 19 17:52:27 mars kernel: [ 5210.746776] RIP: 0010:list_lru_del+0x94/0x140
Dec 19 17:52:27 mars kernel: [ 5210.748399] RSP: 0000:ffffa4f68e9df980 EFLAGS: 00010006
Dec 19 17:52:27 mars kernel: [ 5210.750074] RAX: ffff94fc8b1d07e0 RBX: ffff94fc8c8abe80 RCX: 9c2a8969439858af
Dec 19 17:52:27 mars kernel: [ 5210.751729] RDX: d64808357a5a976e RSI: fffff2e77bab9d9f RDI: ffff94fc8c8abe80
Dec 19 17:52:27 mars kernel: [ 5210.753438] RBP: ffffa4f68e9df9a0 R08: 00000000ffffffff R09: 0000000000000000
Dec 19 17:52:27 mars kernel: [ 5210.755095] R10: ffff94fbaae774a0 R11: 0000000000000001 R12: ffff94fc2ae77490
Dec 19 17:52:27 mars kernel: [ 5210.756749] R13: 0000000000000000 R14: ffff94fbaae77490 R15: 0000000000000000
Dec 19 17:52:27 mars kernel: [ 5210.758417] FS:  00007fbc1b7fe700(0000) GS:ffff94fc9e800000(0000) knlGS:0000000000000000
Dec 19 17:52:27 mars kernel: [ 5210.760062] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 19 17:52:27 mars kernel: [ 5210.761803] CR2: 00007fbbeeb41aec CR3: 00000009b99b1000 CR4: 00000000003406e0
Dec 19 17:52:27 mars kernel: [ 5210.763479] Call Trace:
Dec 19 17:52:27 mars kernel: [ 5210.765185]  ? count_shadow_nodes+0xb0/0xb0
Dec 19 17:52:27 mars kernel: [ 5210.767281]  workingset_update_node+0x4f/0x70
Dec 19 17:52:27 mars kernel: [ 5210.768944]  __radix_tree_replace+0x70/0xf0
Dec 19 17:52:27 mars kernel: [ 5210.770535]  page_cache_tree_insert+0x84/0xc0
Dec 19 17:52:27 mars kernel: [ 5210.772060]  __add_to_page_cache_locked+0xc3/0x200
Dec 19 17:52:27 mars kernel: [ 5210.773536]  add_to_page_cache_lru+0x4e/0xe0
Dec 19 17:52:27 mars kernel: [ 5210.775010]  ext4_mpage_readpages+0x144/0x980
Dec 19 17:52:27 mars kernel: [ 5210.776384]  ? alloc_pages_current+0x6a/0xe0
Dec 19 17:52:27 mars kernel: [ 5210.777735]  ext4_readpages+0x33/0x40
Dec 19 17:52:27 mars kernel: [ 5210.779075]  __do_page_cache_readahead+0x1c3/0x280
Dec 19 17:52:27 mars kernel: [ 5210.780508]  filemap_fault+0x354/0x5e0
Dec 19 17:52:27 mars kernel: [ 5210.781826]  ? filemap_fault+0x354/0x5e0
Dec 19 17:52:27 mars kernel: [ 5210.783149]  ? filemap_map_pages+0x179/0x320
Dec 19 17:52:27 mars kernel: [ 5210.784609]  ext4_filemap_fault+0x31/0x50
Dec 19 17:52:27 mars kernel: [ 5210.786472]  __do_fault+0x1e/0xb0
Dec 19 17:52:27 mars kernel: [ 5210.788519]  __handle_mm_fault+0xba7/0x1020
Dec 19 17:52:27 mars kernel: [ 5210.790035]  handle_mm_fault+0xb1/0x200
Dec 19 17:52:27 mars kernel: [ 5210.791467]  __do_page_fault+0x24d/0x4d0
Dec 19 17:52:27 mars kernel: [ 5210.792879]  ? filp_close+0x53/0x80
Dec 19 17:52:27 mars kernel: [ 5210.794279]  do_page_fault+0x22/0x30
Dec 19 17:52:27 mars kernel: [ 5210.795672]  page_fault+0x28/0x30
Dec 19 17:52:27 mars kernel: [ 5210.797047] RIP: 0033:0x558d8f992ae5
Dec 19 17:52:27 mars kernel: [ 5210.798472] RSP: 002b:00007fbc1b7fc810 EFLAGS: 00010246
Dec 19 17:52:27 mars kernel: [ 5210.800050] RAX: 0000000000000000 RBX: 00007fbc1b7fc8b0 RCX: 0000000000000000
Dec 19 17:52:27 mars kernel: [ 5210.801452] RDX: 00007fbc1b7fc8a0 RSI: 00007fbc1b7fc8e0 RDI: 00007fbc1b7fc8b0
Dec 19 17:52:27 mars kernel: [ 5210.802866] RBP: 00007fbbeeb41ac0 R08: 00007fbc1b7fc8a0 R09: 00007fbc1b7fc900
Dec 19 17:52:27 mars kernel: [ 5210.804278] R10: 0000000000000001 R11: 0000000000000000 R12: 00007fbc1b7fc8a0
Dec 19 17:52:27 mars kernel: [ 5210.805687] R13: 0000000000000000 R14: 00007fbc1b7fc8a0 R15: 0000558d923f7250
Dec 19 17:52:27 mars kernel: [ 5210.807106] Code: 0f 1f 40 00 31 c0 5b 41 5c 41 5d 41 5e 5d c3 48 8b 53 20 48 85 d2 74 05 e9 3b 00 00 00 48 8d 43 08 49 8b 0e 49 8b 56 08 48 89 df <48> 89 51 08 48 89 0a 4d 89 36 4d 89 76 08 48 83 68 10 01 48 83 
Dec 19 17:52:27 mars kernel: [ 5210.808634] RIP: list_lru_del+0x94/0x140 RSP: ffffa4f68e9df980
Dec 19 17:52:27 mars kernel: [ 5210.810342] ---[ end trace 964d9a4894744b45 ]---
abitrolly commented 6 years ago

In hung again last Saturday with this screen, which shows 99.9% swap and 49.6% used memory ???

20171225_130229

abitrolly commented 6 years ago

I killed the GUI part since then, because I needed to get sound out of Radeon RX 470/480 HDMI, and because open source driver doesn't support HDMI, I installed AMD Pro driver, which doesn't work. It looks like we need standard desktop driver, not Pro.

abitrolly commented 6 years ago

Now getting newer driver from http://support.amd.com/en-us/kb-articles/Pages/Radeon-Software-for-Linux-Release-Notes.aspx

abitrolly commented 6 years ago

@YodaMike rebooted server several days ago, because docker refused to kill frozen container. New uptime is here:

$ uptime
 13:29:47 up 1 day, 23:06,  3 users,  load average: 1.75, 1.87, 1.94

The ticket can be closed on 20 days uptime.

abitrolly commented 6 years ago

Updates require us to reboot server from time to time, and it is hard to make it 20 days uptime. But after installing native AMD drivers we couldn't catch it hanging. We've also got sound over HDMI as a result. Closing for now.