SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

purge does not stop iSCSI gateways before killing the cluster #527

Open Martin-Weiss opened 7 years ago

Martin-Weiss commented 7 years ago

4 node cluster running M10 with iSCSI gateways on node 2 / 4. Then trying to kill the cluster with

salt-run disengage.safety salt-run state.orch ceph.purge

hangs forever.

admin-p:~ # salt-run disengage.safety[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.safety is now disabled for cluster cephadmin-p:~ # salt-run state.orch ceph.purge[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.[WARNING ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.

And it seems that it hangs in two jobs:

salt/job/20170810141540766751/ret/osd02-p.ses.intern.thomas-krenn.com { "_stamp": "2017-08-10T12:15:40.813980", "cmd": "_return", "fun": "saltutil.find_job", "fun_args": [ "20170810135712983548" ], "id": "osd02-p.ses.intern.thomas-krenn.com", "jid": "20170810141540766751", "retcode": 0, "return": { "arg": [ "ceph.rescind", { "kwarg": true, "concurrent": false, "queue": false, "saltenv": "base" } ], "fun": "state.sls", "jid": "20170810135712983548", "pid": 7593, "ret": "", "tgt": "I@cluster:ceph", "tgt_type": "compound", "user": "salt" }, "success": true}salt/job/20170810141540766751/ret/osd04-p.ses.intern.thomas-krenn.com { "_stamp": "2017-08-10T12:15:40.814333", "cmd": "_return", "fun": "saltutil.find_job", "fun_args": [ "20170810135712983548" ], "id": "osd04-p.ses.intern.thomas-krenn.com", "jid": "20170810141540766751", "retcode": 0, "return": { "arg": [ "ceph.rescind", { "kwarg": true, "concurrent": false, "queue": false, "saltenv": "base" } ], "fun": "state.sls", "jid": "20170810135712983548", "pid": 2846, "ret": "", "tgt": "I@cluster:ceph", "tgt_type": "compound", "user": "salt" }, "success": true}

minion log from osd-2:

2017-08-10 13:57:15,942 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 13:57:18,504 [salt.loaded.ext.module.osd][ERROR ][7593] Partition /dev/disk/by-id/nvme-nvme.8086-43564654343231353030375734303042474e-494e54454c205353445045444d443430304734-00000001-part8 does not exist2017-08-10 13:57:59,208 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 13:58:42,754 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 13:59:23,123 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal2017-08-10 14:00:01,318 [salt.loaded.ext.module.osd][WARNING ][7593] Forcing OSD removal

minion log from osd-4:

2017-08-10 13:57:15,893 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 13:57:18,448 [salt.loaded.ext.module.osd][ERROR ][2846] Partition /dev/disk/by-id/nvme-nvme.8086-50484654363431303030304c34303042474e-494e54454c205353445045444d443430304734-00000001-part4 does not exist2017-08-10 13:57:58,545 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 13:58:41,837 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 13:59:22,724 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal2017-08-10 14:00:01,144 [salt.loaded.ext.module.osd][WARNING ][2846] Forcing OSD removal

These two servers are iSCSI gateway and RGW - and it seems that the ISCSI gateway is still up and running: osd04-p:~ # targetcli lso- / ......................................................................................................................... [...] o- backstores .............................................................................................................. [...] | o- fileio ................................................................................................... [0 Storage Object] | o- iblock ................................................................................................... [0 Storage Object] | o- pscsi .................................................................................................... [0 Storage Object] | o- rbd ..................................................................................................... [4 Storage Objects] | | o- hdd-hdd-rbd0 ............................................................................ [/dev/rbd/hdd/hdd-rbd0 activated] | | o- hdd-hdd-rbd1 ............................................................................ [/dev/rbd/hdd/hdd-rbd1 activated] | | o- ssd-ssd-rbd0 ............................................................................ [/dev/rbd/ssd/ssd-rbd0 activated] | | o- ssd-ssd-rbd1 ............................................................................ [/dev/rbd/ssd/ssd-rbd1 activated] | o- rd_mcp ................................................................................................... [0 Storage Object] o- ib_srpt ........................................................................................................... [0 Targets] o- iscsi .............................................................................................................. [1 Target] | o- iqn.2016-11.org.linux-iscsi.igw.x86:sn.ses ......................................................................... [2 TPGs] | o- tpg1 ............................................................................................................ [enabled] | | o- acls ........................................................................................................... [0 ACLs] | | o- luns ........................................................................................................... [4 LUNs] | | | o- lun0 ....................................................................... [rbd/ssd-ssd-rbd0 (/dev/rbd/ssd/ssd-rbd0)] | | | o- lun1 ....................................................................... [rbd/ssd-ssd-rbd1 (/dev/rbd/ssd/ssd-rbd1)] | | | o- lun2 ....................................................................... [rbd/hdd-hdd-rbd0 (/dev/rbd/hdd/hdd-rbd0)] | | | o- lun3 ....................................................................... [rbd/hdd-hdd-rbd1 (/dev/rbd/hdd/hdd-rbd1)] | | o- portals ...................................................................................................... [1 Portal] | | o- 172.16.1.54:3260 .................................................................................. [OK, iser disabled] | o- tpg2 ........................................................................................................... [disabled] | o- acls ........................................................................................................... [0 ACLs] | o- luns ........................................................................................................... [4 LUNs] | | o- lun0 ....................................................................... [rbd/ssd-ssd-rbd0 (/dev/rbd/ssd/ssd-rbd0)] | | o- lun1 ....................................................................... [rbd/ssd-ssd-rbd1 (/dev/rbd/ssd/ssd-rbd1)] | | o- lun2 ....................................................................... [rbd/hdd-hdd-rbd0 (/dev/rbd/hdd/hdd-rbd0)] | | o- lun3 ....................................................................... [rbd/hdd-hdd-rbd1 (/dev/rbd/hdd/hdd-rbd1)] | o- portals ...................................................................................................... [1 Portal] | o- 172.16.1.52:3260 .................................................................................. [OK, iser disabled] o- loopback .......................................................................................................... [0 Targets] o- qla2xxx ........................................................................................................... [0 Targets] o- tcm_fc ............................................................................................................ [0 Targets] o- vhost ............................................................................................................. [0 Targets]

--> Could it be that the purge does not stop and remove the iSCSI stuff before it "kills" the cluster?

Martin-Weiss commented 6 years ago

Today I ran into this problem, again. Is there someone working on this?

Martin-Weiss commented 6 years ago

One more thing - had to realized that during the purge this happens:

root 18078 0.0 0.0 11772 2244 ? S 14:29 0:00 /bin/bash -c . /etc/sysconfig/lrbd; /usr/sbin/lrbd $LRBD_OPTIONS -W || : root 18079 0.0 0.0 626332 32328 ? Sl 14:29 0:00 /usr/bin/python /usr/sbin/lrbd -n client.igw.osd04-p -W

--> could it be that this is part of the uninstall script of the lrbd rpm? --> could it be that this hangs forever if the cluster is not reachable?