adoptium / infrastructure

This repo contains all information about machine maintenance.
Apache License 2.0
86 stars 102 forks source link

build-marist-rhel77-s390x-[234] crashing daily #1154

Closed sxa closed 4 years ago

sxa commented 4 years ago

The first build machine seems ok, but the second one (148.100.86.218) is repeatedly falling over:

linux1   pts/0        195.212.29.74    Thu Feb 20 06:39   still logged in   
reboot   system boot  3.10.0-1062.12.1 Wed Feb 19 15:50 - 06:41  (14:50)    
reboot   system boot  3.10.0-1062.12.1 Tue Feb 18 15:49 - 06:41 (1+14:51)   
root     pts/0        195.212.29.94    Tue Feb 18 13:02 - 13:21  (00:19)    
root     pts/1        195.212.29.94    Tue Feb 18 10:25 - 12:40  (02:15)    
linux1   pts/0        195.212.29.94    Tue Feb 18 09:55 - 12:14  (02:18)    
reboot   system boot  3.10.0-1062.12.1 Mon Feb 17 06:09 - 06:41 (3+00:31)   
reboot   system boot  3.10.0-1062.12.1 Sat Feb 15 19:11 - 06:41 (4+11:29)   
reboot   system boot  3.10.0-1062.12.1 Sat Feb 15 14:14 - 06:41 (4+16:26)   
reboot   system boot  3.10.0-1062.12.1 Thu Feb 13 14:34 - 06:41 (6+16:06)   
reboot   system boot  3.10.0-1062.12.1 Wed Feb 12 14:14 - 06:41 (7+16:26)   
root     pts/0        195.212.29.66    Wed Feb 12 10:21 - 12:34  (02:12)    
reboot   system boot  3.10.0-1062.12.1 Mon Feb 10 19:12 - 06:41 (9+11:28)   
linux1   pts/0        195.212.29.86    Mon Feb 10 10:01 - 10:07  (00:06)    
linux1   pts/0        195.212.29.86    Mon Feb 10 07:30 - 07:30  (00:00)    
linux1   pts/0        195.212.29.86    Fri Feb  7 11:09 - 12:06  (00:56)    
linux1   pts/0        195.212.29.86    Fri Feb  7 10:44 - 10:46  (00:01)    
linux1   pts/0        195.212.29.86    Thu Feb  6 14:16 - 14:16  (00:00)    
reboot   system boot  3.10.0-1062.12.1 Thu Feb  6 07:51 - 06:41 (13+22:49)  
reboot   system boot  3.10.0-1062.9.1. Tue Feb  4 14:13 - 06:41 (15+16:27)  
linux1   pts/2        195.212.29.82    Tue Feb  4 10:18 - 10:19  (00:00)    
linux1   pts/1        195.212.29.82    Tue Feb  4 10:10 - 12:32  (02:21)    
linux1   pts/0        195.212.29.82    Tue Feb  4 08:57 - 11:23  (02:25)    
reboot   system boot  3.10.0-1062.9.1. Mon Feb  3 07:40 - 06:41 (16+23:00)  
reboot   system boot  3.10.0-1062.9.1. Sun Feb  2 07:43 - 06:41 (17+22:57)  
reboot   system boot  3.10.0-1062.9.1. Sat Feb  1 23:40 - 06:41 (18+07:00)  
reboot   system boot  3.10.0-1062.9.1. Sat Feb  1 14:17 - 06:41 (18+16:23)  
reboot   system boot  3.10.0-1062.9.1. Sat Feb  1 00:41 - 06:41 (19+05:59)  
reboot   system boot  3.10.0-1062.9.1. Thu Jan 30 14:18 - 06:41 (20+16:22)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan 29 14:32 - 06:41 (21+16:09)  
reboot   system boot  3.10.0-1062.9.1. Tue Jan 28 07:23 - 06:41 (22+23:17)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 24 00:44 - 06:41 (27+05:56)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan 22 10:54 - 06:41 (28+19:46)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan 22 09:15 - 06:41 (28+21:25)  
reboot   system boot  3.10.0-1062.9.1. Tue Jan 21 14:13 - 06:41 (29+16:27)  
reboot   system boot  3.10.0-1062.9.1. Sun Jan 19 13:47 - 06:41 (31+16:53)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 17 23:50 - 06:41 (33+06:50)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 17 05:51 - 06:41 (34+00:49)  
reboot   system boot  3.10.0-1062.9.1. Thu Jan 16 00:30 - 06:41 (35+06:10)  
reboot   system boot  3.10.0-1062.9.1. Tue Jan 14 14:25 - 06:41 (36+16:15)  
reboot   system boot  3.10.0-1062.9.1. Mon Jan 13 15:11 - 06:41 (37+15:29)  
reboot   system boot  3.10.0-1062.9.1. Sun Jan 12 01:08 - 06:41 (39+05:32)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 10 14:28 - 06:41 (40+16:12)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan  8 10:42 - 06:41 (42+19:58)  
reboot   system boot  3.10.0-1062.9.1. Sun Jan  5 05:37 - 06:41 (46+01:03)  
reboot   system boot  3.10.0-1062.9.1. Sat Jan  4 00:37 - 06:41 (47+06:03)  
reboot   system boot  3.10.0-1062.9.1. Thu Jan  2 14:13 - 06:41 (48+16:27)  
linux1   pts/0        195.212.29.70    Thu Jan  2 10:05 - 12:32  (02:27)    
reboot   system boot  3.10.0-1062.9.1. Wed Jan  1 01:42 - 06:41 (50+04:58)  
reboot   system boot  3.10.0-1062.9.1. Tue Dec 31 13:09 - 06:41 (50+17:31)  
reboot   system boot  3.10.0-1062.9.1. Sun Dec 29 23:36 - 06:41 (52+07:04)  
reboot   system boot  3.10.0-1062.9.1. Sun Dec 29 00:19 - 06:41 (53+06:21)  
reboot   system boot  3.10.0-1062.9.1. Fri Dec 27 04:03 - 06:41 (55+02:37)  
reboot   system boot  3.10.0-1062.9.1. Thu Dec 26 14:13 - 06:41 (55+16:27)  
reboot   system boot  3.10.0-1062.9.1. Mon Dec 23 23:37 - 06:41 (58+07:03)  
reboot   system boot  3.10.0-1062.9.1. Sat Dec 21 14:23 - 06:41 (60+16:17)  

Looks like a kernel crash as follows:

[86417.737380] Unable to handle kernel pointer dereference at virtual kernel address 10180004a7f40000
[86417.737423] Oops: 0038 [#1] SMP 
[86417.737427] Modules linked in: isofs xt_pkttype ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter loop af_iucv qeth_l2 vmur ip_tables ext4 mbcache jbd2 dasd_fba_mod dasd_mod qeth ccwgroup qdio prng sha512_s390 ghash_s390 des_s390 des_generic aes_s390
[86417.737479] CPU: 2 PID: 38890 Comm: cc1plus Kdump: loaded Not tainted 3.10.0-1062.12.1.el7.s390x #1
[86417.737483] task: 0000000001e35d80 ti: 000000007dd9c000 task.ti: 000000007dd9c000
[86417.737486] Krnl PSW : 0704e00180000000 00000000004886e0 (__radix_tree_lookup+0x50/0x118)
[86417.737498]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000279080 10180004a7f40089 0000000000000001 10180004a7f40088
[86417.737500]            0000000000000000 000000007dd9fcd8 000002001e070000 0000000000000001
[86417.737501]            0000000000000000 000000007dd9fcd8 0000000000000040 000000000000a2f8
[86417.737502]            18000000003b55d0 000000000000a2f8 000000007dd9fc38 000000007dd9fbe8
[86417.737510] Krnl Code: 00000000004886ce: ec213ebf0055    risbg   %r2,%r1,62,191,0
       00000000004886d4: ec260063017c   cgij    %r2,1,6,48879a
      #00000000004886da: ec3100be0055   risbg   %r3,%r1,0,190,0
      >00000000004886e0: e33030000094   llc %r3,0(%r3)
       00000000004886e6: eb3a3000000d   sllg    %r3,%r10,0(%r3)
       00000000004886ec: a73bffff       aghi    %r3,-1
       00000000004886f0: ecc300492065   clgrj   %r12,%r3,2,488782
       00000000004886f6: ec26004a017c   cgij    %r2,1,6,48878a
[86417.737521] Call Trace:
[86417.737522] ([<0000000000b97700>] contig_page_data+0x700/0x1600)
[86417.737526]  [<00000000004887d4>] radix_tree_lookup_slot+0x2c/0x50
[86417.737528]  [<00000000002790b4>] __find_get_page+0x4c/0xd0
[86417.737531]  [<0000000000279174>] find_get_page+0x3c/0x58
[86417.737532]  [<00000000002ca24a>] lookup_swap_cache+0x7a/0x178
[86417.737534]  [<00000000002caa7c>] swap_readahead_detect+0xac/0x318
[86417.737535]  [<00000000002b4d00>] __handle_mm_fault+0x238/0x1028
[86417.737537]  [<00000000002b5bd6>] handle_mm_fault+0xe6/0x188
[86417.737538]  [<000000000075d5f4>] do_dat_exception+0x194/0x308
[86417.737542]  [<000000000075b728>] pgm_check_handler+0x168/0x16c
[86417.737543]  [<0000000080a65a2e>] 0x80a65a2e
[86417.737545] Last Breaking-Event-Address:
[86417.737546]  [<00000000004887ce>] radix_tree_lookup_slot+0x26/0x50
[86417.737547]  
[86417.737548] Kernel panic - not syncing: Fatal exception: panic_on_oops
sxa commented 4 years ago

Two new machines added build-marist-rhel77-s390x-3 and build-marist-rhel77-s390x-4 with the same kernel level as the failing machine. It's also worth noting that since disabling the machine in jenkins it hasn't crashed. I've re-enabled it to see if it falls over tonight

sxa commented 4 years ago

New machine 148.100.245.197 (-4) has just crashed during a build: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk14/job/jdk14-linux-s390x-openj9lastFailedBuild/console

M-Davies commented 4 years ago

build-marist-rhel77-s390x-2 crashed during a jdk14 build https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk14/job/jdk14-linux-s390x-hotspot/39

sxa commented 4 years ago

Swapfile disabled on next reboot on machines 2 to 4. I've rebooted -2 so that will take effect immediately. We'll see if that makes any difference tonight.

sxa commented 4 years ago

-3 has been upgraded to have 16GB of RAM so I've re-enabled it alongside -1 and we'll see if it's any more stable

sxa commented 4 years ago

-3 worked ok yesterday although it only seems to have ran one of the jobs: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-linux-s390x-openj9-linuxXL/103/consoleFull

sxa commented 4 years ago

Rerunning another pipeline and the following are running on -3:

I think I'll mark -1 offline tonight to force everything to -3 and see what happens

sxa commented 4 years ago

The 16Gb system failed again today. I'm going to start logging the failures: -2 https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-s390x-openj9/536/consoleFull

sxa commented 4 years ago

Latest kernel update from RedHat appears to have resolved this on all machines regardless of memory/swap setup - it was installed on the 19th March and none of the machines have crashed in the last week. OpenJ9 (via @jdekonin) reporting the same success so I'm going to close this :-)

[root@adoptopenjdk01 ~]# rpm -qi kernel-3.10.0-1062.18.1.el7.s390x
Name        : kernel
Version     : 3.10.0
Release     : 1062.18.1.el7
Architecture: s390x
Install Date: Thu 19 Mar 2020 01:00:29 EDT
Group       : System Environment/Kernel

For completeness, the original machine was on this kernel:

[linux1@localhost ~]$ uname -a
Linux localhost.adoptopenjdk.net 3.10.0-957.21.3.el7.s390x #1 SMP Fri Jun 14 02:52:25 EDT 2019 s390x s390x s390x GNU/Linux

The failing ones were on

[linux1@adoptopenjdk01 ~]$ uname -a
Linux adoptopenjdk01.novalocal 3.10.0-1062.12.1.el7.s390x #1 SMP Thu Dec 12 06:45:30 EST 2019 s390x s390x s390x GNU/Linux

And the new ones are:

[root@adoptopenjdk03 ~]# uname -a
Linux adoptopenjdk03.novalocal 3.10.0-1062.18.1.el7.s390x #1 SMP Wed Feb 12 09:11:02 EST 2020 s390x s390x s390x GNU/Linux