Open Martin-Weiss opened 7 years ago
Today I ran into this problem, again. Is there someone working on this?
One more thing - had to realized that during the purge this happens:
root 18078 0.0 0.0 11772 2244 ? S 14:29 0:00 /bin/bash -c . /etc/sysconfig/lrbd; /usr/sbin/lrbd $LRBD_OPTIONS -W || : root 18079 0.0 0.0 626332 32328 ? Sl 14:29 0:00 /usr/bin/python /usr/sbin/lrbd -n client.igw.osd04-p -W
--> could it be that this is part of the uninstall script of the lrbd rpm? --> could it be that this hangs forever if the cluster is not reachable?
4 node cluster running M10 with iSCSI gateways on node 2 / 4. Then trying to kill the cluster with
salt-run disengage.safety salt-run state.orch ceph.purge
hangs forever.
admin-p:~ # salt-run disengage.safety[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.safety is now disabled for cluster cephadmin-p:~ # salt-run state.orch ceph.purge[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.
And it seems that it hangs in two jobs:
salt/job/20170810141540766751/ret/osd02-p.ses.intern.thomas-krenn.com { "_stamp": "2017-08-10T12:15:40.813980", "cmd": "_return", "fun": "saltutil.find_job", "fun_args": [ "20170810135712983548" ], "id": "osd02-p.ses.intern.thomas-krenn.com", "jid": "20170810141540766751", "retcode": 0, "return": { "arg": [ "ceph.rescind", { "kwarg": true, "concurrent": false, "queue": false, "saltenv": "base" } ], "fun": "state.sls", "jid": "20170810135712983548", "pid": 7593, "ret": "", "tgt": "I@cluster:ceph", "tgt_type": "compound", "user": "salt" }, "success": true}salt/job/20170810141540766751/ret/osd04-p.ses.intern.thomas-krenn.com { "_stamp": "2017-08-10T12:15:40.814333", "cmd": "_return", "fun": "saltutil.find_job", "fun_args": [ "20170810135712983548" ], "id": "osd04-p.ses.intern.thomas-krenn.com", "jid": "20170810141540766751", "retcode": 0, "return": { "arg": [ "ceph.rescind", { "kwarg": true, "concurrent": false, "queue": false, "saltenv": "base" } ], "fun": "state.sls", "jid": "20170810135712983548", "pid": 2846, "ret": "", "tgt": "I@cluster:ceph", "tgt_type": "compound", "user": "salt" }, "success": true}
minion log from osd-2:
2017-08-10 13:57:15,942 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 13:57:18,504 [salt.loaded.ext.module.osd][ERROR ][7593] Partition /dev/disk/by-id/nvme-nvme.8086-43564654343231353030375734303042474e-494e54454c205353445045444d443430304734-00000001-part8 does not exist2017-08-10 13:57:59,208 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 13:58:42,754 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 13:59:23,123 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 14:00:01,318 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal
minion log from osd-4:
2017-08-10 13:57:15,893 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 13:57:18,448 [salt.loaded.ext.module.osd][ERROR ][2846] Partition /dev/disk/by-id/nvme-nvme.8086-50484654363431303030304c34303042474e-494e54454c205353445045444d443430304734-00000001-part4 does not exist2017-08-10 13:57:58,545 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 13:58:41,837 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 13:59:22,724 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 14:00:01,144 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal
These two servers are iSCSI gateway and RGW - and it seems that the ISCSI gateway is still up and running: osd04-p:~ # targetcli lso- / ......................................................................................................................... [...] o- backstores .............................................................................................................. [...] | o- fileio ................................................................................................... [0 Storage Object] | o- iblock ................................................................................................... [0 Storage Object] | o- pscsi .................................................................................................... [0 Storage Object] | o- rbd ..................................................................................................... [4 Storage Objects] | | o- hdd-hdd-rbd0 ............................................................................ [/dev/rbd/hdd/hdd-rbd0 activated] | | o- hdd-hdd-rbd1 ............................................................................ [/dev/rbd/hdd/hdd-rbd1 activated] | | o- ssd-ssd-rbd0 ............................................................................ [/dev/rbd/ssd/ssd-rbd0 activated] | | o- ssd-ssd-rbd1 ............................................................................ [/dev/rbd/ssd/ssd-rbd1 activated] | o- rd_mcp ................................................................................................... [0 Storage Object] o- ib_srpt ........................................................................................................... [0 Targets] o- iscsi .............................................................................................................. [1 Target] | o- iqn.2016-11.org.linux-iscsi.igw.x86:sn.ses ......................................................................... [2 TPGs] | o- tpg1 ............................................................................................................ [enabled] | | o- acls ........................................................................................................... [0 ACLs] | | o- luns ........................................................................................................... [4 LUNs] | | | o- lun0 ....................................................................... [rbd/ssd-ssd-rbd0 (/dev/rbd/ssd/ssd-rbd0)] | | | o- lun1 ....................................................................... [rbd/ssd-ssd-rbd1 (/dev/rbd/ssd/ssd-rbd1)] | | | o- lun2 ....................................................................... [rbd/hdd-hdd-rbd0 (/dev/rbd/hdd/hdd-rbd0)] | | | o- lun3 ....................................................................... [rbd/hdd-hdd-rbd1 (/dev/rbd/hdd/hdd-rbd1)] | | o- portals ...................................................................................................... [1 Portal] | | o- 172.16.1.54:3260 .................................................................................. [OK, iser disabled] | o- tpg2 ........................................................................................................... [disabled] | o- acls ........................................................................................................... [0 ACLs] | o- luns ........................................................................................................... [4 LUNs] | | o- lun0 ....................................................................... [rbd/ssd-ssd-rbd0 (/dev/rbd/ssd/ssd-rbd0)] | | o- lun1 ....................................................................... [rbd/ssd-ssd-rbd1 (/dev/rbd/ssd/ssd-rbd1)] | | o- lun2 ....................................................................... [rbd/hdd-hdd-rbd0 (/dev/rbd/hdd/hdd-rbd0)] | | o- lun3 ....................................................................... [rbd/hdd-hdd-rbd1 (/dev/rbd/hdd/hdd-rbd1)] | o- portals ...................................................................................................... [1 Portal] | o- 172.16.1.52:3260 .................................................................................. [OK, iser disabled] o- loopback .......................................................................................................... [0 Targets] o- qla2xxx ........................................................................................................... [0 Targets] o- tcm_fc ............................................................................................................ [0 Targets] o- vhost ............................................................................................................. [0 Targets]
--> Could it be that the purge does not stop and remove the iSCSI stuff before it "kills" the cluster?