cndaqiang / E5-PC-daily

服务器集群管理遇到的问题和总结
1 stars 0 forks source link

算盘维护 #5

Closed cndaqiang closed 4 years ago

cndaqiang commented 4 years ago

备份: tar -czvfp $1 $2 --exclude=$2/proc --exclude=$2/lost+found --exclude=$2/tmp --exclude=$2/sys --exclude=$2/media

cndaqiang commented 4 years ago

计算节点恢复

umount /opt umount /home mount -t nfs ibmu01:/home /home #cndaqiang 2019-09-10 mount -t nfs ibmu01:/opt /opt #cndaqiang 2019-09-10

rm -f /var/lock/subsys/pbs_mom rm -f /opt/tsce4/torque6/share/${HOSTNAME}/mom_priv/mom.lock pkill -f pbs service pbs_mom restart

cndaqiang commented 4 years ago

把主板坏了的机器上拆下的硬盘放到正常节点(硬盘坏了)上的操作

[root@cu15 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0 #删除硬件地址uuid [root@cu15 ~]# vi /etc/sysconfig/network-scripts/ifcfg-ib0 [root@cu15 ~]# rm /etc/udev/rules.d/70-persistent-net.rules

把不再集群上的cu27改为主板坏了cu03

该主机名,改ip

vi /etc/sysconfig/network # 加入 HOSTNAME=cu03
vi /etc/sysconfig/network-scripts/ifcfg-eth0  #改网卡ip
vi /etc/sysconfig/network-scripts/ifcfg-ib0    #改IB ip
cndaqiang commented 4 years ago

备份磁盘

dd if=/dev/sdb of=mbr.bin 另开窗口,查看进度 watch -n 5 killall -USR1 dd

克隆磁盘

dd if=/dev/sdb of=/dev/sdc dd if=/dev/sdb3 of=/dev/sdc3

cndaqiang commented 4 years ago

cu13 以太网建立的集群管理,节点经常掉线 怀疑是以太网意外掉线,改成ib网络管理 修改11.11.11.13 cu06为 10.10.10.13. cu06(ib ip) cu13 上改mu01为 10.10.100.1 mu01

cndaqiang commented 4 years ago

查看磁盘uuid [root@cu01 grub]# ls -l /dev/disk/by-uuid/ total 0 lrwxrwxrwx 1 root root 10 Sep 11 10:46 42e1b448-9559-4f1b-96a1-4f8bc1e30976 -> ../../sda2 lrwxrwxrwx 1 root root 10 Sep 11 10:46 ad7dc6ed-6f0f-4f12-b52c-a52453ee7c1e -> ../../sda3 lrwxrwxrwx 1 root root 10 Sep 11 10:46 b761f487-e145-4659-a313-077fed2ca742 -> ../../sda1

cndaqiang commented 4 years ago

硬盘uuid变化时,修改引导的grub所在分区中,引导系统的uuid

cndaqiang commented 4 years ago

``[root@cu08 ~]# service pbs_mom start Starting TORQUE Mom: pbs_mom: LOG_ERROR::Input/output error (5) in pbs_mom, cannot lock '/opt/tsce4/torque6/share/cu08/mom_priv/mom.lock' - another mom running cannot lock '/opt/tsce4/torque6/share/cu08/mom_priv/mom.lock' - another mom running [FAILED]


重新挂载nfs
cndaqiang commented 4 years ago

for i in $(seq 1 1 36); do ping -c 1 -w 0.01 11.11.11.$i; done | grep ttl

for i in $(seq 1 1 36)
do 
ssh  11.11.11.$i    poweroff
echo 11.11.11.$i "done"
done
cndaqiang commented 4 years ago

pbs

强制删除任务

qdel -p 47848