Closed cndaqiang closed 4 years ago
计算节点恢复
umount /opt umount /home mount -t nfs ibmu01:/home /home #cndaqiang 2019-09-10 mount -t nfs ibmu01:/opt /opt #cndaqiang 2019-09-10
rm -f /var/lock/subsys/pbs_mom rm -f /opt/tsce4/torque6/share/${HOSTNAME}/mom_priv/mom.lock pkill -f pbs service pbs_mom restart
[root@cu15 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0 #删除硬件地址uuid [root@cu15 ~]# vi /etc/sysconfig/network-scripts/ifcfg-ib0 [root@cu15 ~]# rm /etc/udev/rules.d/70-persistent-net.rules
该主机名,改ip
vi /etc/sysconfig/network # 加入 HOSTNAME=cu03
vi /etc/sysconfig/network-scripts/ifcfg-eth0 #改网卡ip
vi /etc/sysconfig/network-scripts/ifcfg-ib0 #改IB ip
dd if=/dev/sdb of=mbr.bin 另开窗口,查看进度 watch -n 5 killall -USR1 dd
dd if=/dev/sdb of=/dev/sdc dd if=/dev/sdb3 of=/dev/sdc3
cu13 以太网建立的集群管理,节点经常掉线 怀疑是以太网意外掉线,改成ib网络管理 修改11.11.11.13 cu06为 10.10.10.13. cu06(ib ip) cu13 上改mu01为 10.10.100.1 mu01
查看磁盘uuid [root@cu01 grub]# ls -l /dev/disk/by-uuid/ total 0 lrwxrwxrwx 1 root root 10 Sep 11 10:46 42e1b448-9559-4f1b-96a1-4f8bc1e30976 -> ../../sda2 lrwxrwxrwx 1 root root 10 Sep 11 10:46 ad7dc6ed-6f0f-4f12-b52c-a52453ee7c1e -> ../../sda3 lrwxrwxrwx 1 root root 10 Sep 11 10:46 b761f487-e145-4659-a313-077fed2ca742 -> ../../sda1
硬盘uuid变化时,修改引导的grub所在分区中,引导系统的uuid
``[root@cu08 ~]# service pbs_mom start Starting TORQUE Mom: pbs_mom: LOG_ERROR::Input/output error (5) in pbs_mom, cannot lock '/opt/tsce4/torque6/share/cu08/mom_priv/mom.lock' - another mom running cannot lock '/opt/tsce4/torque6/share/cu08/mom_priv/mom.lock' - another mom running [FAILED]
重新挂载nfs
for i in $(seq 1 1 36); do ping -c 1 -w 0.01 11.11.11.$i; done | grep ttl
for i in $(seq 1 1 36)
do
ssh 11.11.11.$i poweroff
echo 11.11.11.$i "done"
done
qdel -p 47848
备份: tar -czvfp $1 $2 --exclude=$2/proc --exclude=$2/lost+found --exclude=$2/tmp --exclude=$2/sys --exclude=$2/media