AAROC / chpctier2

Issues with the CHPC Tier2 Facility
0 stars 0 forks source link

alice eos is down as of 4pm this afternoon #37

Closed bazinski closed 8 years ago

bazinski commented 8 years ago

tests from alice central services are failing on ailce::za_chpc::eos

bazinski commented 8 years ago

after rebooting grid-se2 to be able to get in, it would seem that it is locked off from the 172.20.100 network

it can see neither grid-xrootd02 (its data store) nor other machines on the 172.20.100 network, yet it thinks its 172.20.100. network is up

bazinski commented 8 years ago

dmesg says : igb 0000:07:00.0: added PHC on eth2 igb 0000:07:00.0: eth2: (PCIe:5.0Gb/s:Width x2) b8:ca:3a:6c:5d:44 igb 0000:07:00.0: eth2: PBA No: G61346-000 ADDRCONF(NETDEV_UP): eth2: link is not ready

so either, cable was pulled out, cable is broken, or port is broken. This will require someone on site to take a look

bazinski commented 8 years ago

so lets attempt to route all traffic over the public interface

bazinski commented 8 years ago

it appears that eos hardcodes the ip ? it wont boot up the fst's fs boot grid-xrootd02.chpc.ac.za:/data/1 returns error

bazinski commented 8 years ago

well this is not ideal :

xrootd: page allocation failure. order:1, mode:0x20 Pid: 19484, comm: xrootd Not tainted 2.6.32-504.23.4.el6.x86_64 Call Trace:

[] ? __alloc_pages_nodemask+0x74a/0x8d0 [] ? kmem_getpages+0x62/0x170 [] ? fallback_alloc+0x1ba/0x270 [] ? cache_grow+0x2cf/0x320 [] ? ____cache_alloc_node+0x99/0x160 [] ? kmem_cache_alloc+0x123/0x190 [] ? sk_prot_alloc+0x48/0x1c0 [] ? sk_clone+0x22/0x2e0 [] ? inet_csk_clone+0x16/0xd0 [] ? tcp_create_openreq_child+0x23/0x470 [] ? tcp_v4_syn_recv_sock+0x4d/0x310 [] ? tcp_check_req+0x226/0x460 [] ? tcp_v4_do_rcv+0x35b/0x490 [] ? tcp_v4_rcv+0x532/0x910 [] ? ip_local_deliver_finish+0x0/0x2d0 [] ? ip_local_deliver_finish+0xdd/0x2d0 [] ? ip_local_deliver+0x98/0xa0 [] ? ip_rcv_finish+0x12d/0x440 [] ? ip_rcv+0x275/0x350 [] ? __netif_receive_skb+0x208/0x570 [] ? process_backlog+0x9a/0x100 [] ? net_rx_action+0x103/0x2f0 [] ? __do_softirq+0xc1/0x1e0 [] ? call_softirq+0x1c/0x30 [] ? do_softirq+0x65/0xa0 [] ? local_bh_enable_ip+0x9a/0xb0 [] ? _spin_unlock_bh+0x1b/0x20 [] ? release_sock+0xe5/0x110 [] ? inet_stream_connect+0x183/0x2c0 [] ? autoremove_wake_function+0x0/0x40 [] ? sys_connect+0xd7/0xf0 [] ? fd_install+0x47/0x90 [] ? audit_syscall_entry+0x1d7/0x200 [] ? __audit_syscall_exit+0x25e/0x290 [] ? system_call_fastpath+0x16/0x1b
bazinski commented 8 years ago

ip's are back to internal the rerouting of traffic is problematic. internal connection to grid-se2 is in urgent need of fixing.

bazinski commented 8 years ago

network cable is loose and its clippy thing is not great, cable re inserted. It was not completely out.

This just makes the network reconfig in # #21 all the more urgent