SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
161 stars 75 forks source link

deepsea can not create OSD but does not throw an error #1093

Open Martin-Weiss opened 6 years ago

Martin-Weiss commented 6 years ago

Description of Issue/Question

We deploy a new cluster and we get 7 out of 8 configured OSDs created. One fails continuously to be created but stage.3 does not show any error! We see that with every rerun of stage.3 two additional partitions are created on the NVMe (one for the WAL and one for the RocksDB of the OSD that fails to be created).

Setup

4 OSD node + 1 admin cluster , super micro hardware, NVMe and spinner Bluestore setup. 32 OSDs - each OSD node has 2 NVMe - ratio 1:4

Steps to Reproduce Issue

run / re-run stage.3

Versions Report

(Provided by running: salt-run deepsea.version

0.8.2+git.0.6b39c2648

rpm -qi salt-minion

dd-ses-adm:~ # rpm -qi salt-minion Name : salt-minion Version : 2016.11.4 Release : 46.17.1 Architecture: x86_64 Install Date: Tue Apr 17 13:59:37 2018 Group : System/Management Size : 38128 License : Apache-2.0 Signature : RSA/SHA256, Fri Feb 9 15:06:01 2018, Key ID 70af9e8139db7c82 Source RPM : salt-2016.11.4-46.17.1.src.rpm Build Date : Fri Feb 9 15:03:24 2018 Build Host : sheep26 Relocations : (not relocatable) Packager : https://www.suse.com/ Vendor : SUSE LLC https://www.suse.com/ URL : http://saltstack.org/ Summary : The client component for Saltstack Description : Salt minion is queried and controlled from the master. Listens to the salt master and execute the commands. Distribution: SUSE Linux Enterprise 12

rpm -qi salt-master

dd-ses-adm:~ # rpm -qi salt-master Name : salt-master Version : 2016.11.4 Release : 46.17.1 Architecture: x86_64 Install Date: Tue Apr 17 13:57:40 2018 Group : System/Management Size : 1662872 License : Apache-2.0 Signature : RSA/SHA256, Fri Feb 9 15:06:01 2018, Key ID 70af9e8139db7c82 Source RPM : salt-2016.11.4-46.17.1.src.rpm Build Date : Fri Feb 9 15:03:24 2018 Build Host : sheep26 Relocations : (not relocatable) Packager : https://www.suse.com/ Vendor : SUSE LLC https://www.suse.com/ URL : http://saltstack.org/ Summary : The management component of Saltstack with zmq protocol supported Description : The Salt master is the central server to which all minions connect. Enabled commands to remote systems to be called in parallel rather than serially. Distribution: SUSE Linux Enterprise 12

)

Debug from one salt-minion for the failing command:

2018-04-17 17:27:17,704 [salt.loaded.ext.module.osd][DEBUG   ][16550] found partition 14 on device /dev/nvme0n1
2018-04-17 17:27:17,704 [salt.loaded.ext.module.osd][DEBUG   ][16550] Found [] partitions on /dev/sdc
2018-04-17 17:27:17,705 [salt.loaded.ext.module.osd][INFO    ][16550] prepare: PYTHONWARNINGS=ignore ceph-disk -v prepare --bluestore --data-dev --journal-dev --cluster ceph --cluster-uuid b72520db-a5fa-3a77-b236-27f522ca81e4 --block.wal
 /dev/nvme0n1p13 --block.db /dev/nvme0n1p14 /dev/sdc
2018-04-17 17:27:17,705 [salt.loaded.ext.module.osd][INFO    ][16550] PYTHONWARNINGS=ignore ceph-disk -v prepare --bluestore --data-dev --journal-dev --cluster ceph --cluster-uuid b72520db-a5fa-3a77-b236-27f522ca81e4 --block.wal /dev/nvm
e0n1p13 --block.db /dev/nvme0n1p14 /dev/sdc
2018-04-17 17:27:18,107 [salt.loaded.ext.module.osd][DEBUG   ][16550] return code: 1
2018-04-17 17:27:18,107 [salt.loaded.ext.module.osd][DEBUG   ][16550]
2018-04-17 17:27:18,108 [salt.loaded.ext.module.osd][DEBUG   ][16550]
2018-04-17 17:27:18,108 [salt.loaded.ext.module.osd][DEBUG   ][16550] ''
2018-04-17 17:27:18,110 [salt.loaded.ext.module.osd][DEBUG   ][16550] "get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid\nset_type: Will colocate block with data on /dev/sdc\ncommand: Running command: /usr/bin/ce
ph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_db_size\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --nam
e=osd. --lookup bluestore_block_size\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_wal_size\nget_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid\nget_dm_uuid: get
_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid\nget_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_type\ncomma
nd: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_type\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs\ncommand: Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs\ncommand: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd.
--lookup osd_fs_mount_options_xfs\nget_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid\nset_data_partition: Creating osd partition on /dev/sdc\nget_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/
uuid\nptype_tobe_for_name: name = data\nget_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid\ncreate_partition: Creating data partition num 1 size 100 on /dev/sdc\ncommand_check_call: Running command: /usr/sbin/sgdi
sk --new=1:0:+100M --change-name=1:ceph data --partition-guid=1:65bb90b9-5ee6-494f-9c3c-4cb4d63b4374 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdc\n\x07Caution: invalid main GPT header, but valid backup; regene
rating main header\nfrom backup!\n\nInvalid partition data!\n'/usr/sbin/sgdisk --new=1:0:+100M --change-name=1:ceph data --partition-guid=1:65bb90b9-5ee6-494f-9c3c-4cb4d63b4374 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt
 -- /dev/sdc' failed with status code 2\n"
2018-04-17 17:27:18,111 [salt.loaded.ext.module.osd][DEBUG   ][16550] format: bluestore
2018-04-17 17:27:18,111 [salt.loaded.ext.module.osd][INFO    ][16550] activate: PYTHONWARNINGS=ignore ceph-disk -v activate --mark-init systemd --mount /dev/sdc1
2018-04-17 17:27:18,111 [salt.loaded.ext.module.osd][INFO    ][16550] PYTHONWARNINGS=ignore ceph-disk -v activate --mark-init systemd --mount /dev/sdc1
2018-04-17 17:27:18,244 [salt.loaded.ext.module.osd][DEBUG   ][16550] return code: 1
2018-04-17 17:27:18,245 [salt.loaded.ext.module.osd][DEBUG   ][16550]
2018-04-17 17:27:18,245 [salt.loaded.ext.module.osd][DEBUG   ][16550]
2018-04-17 17:27:18,245 [salt.loaded.ext.module.osd][DEBUG   ][16550] ''
2018-04-17 17:27:18,246 [salt.loaded.ext.module.osd][DEBUG   ][16550] 'main_activate: path = /dev/sdc1\nTraceback (most recent call last):\n  File "/usr/sbin/ceph-disk", line 9, in <module>\n    load_entry_point(\'ceph-disk==1.0.0\', \'c
onsole_scripts\', \'ceph-disk\')()\n  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5736, in run\n    main(sys.argv[1:])\n  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5674, in main\n    args.func(ar
gs)\n  File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3740, in main_activate\n    raise Error(\'%s does not exist\' % args.path)\nceph_disk.main.Error: Error: /dev/sdc1 does not exist\n'
2018-04-17 17:27:18,247 [salt.state       ][INFO    ][16550] {'ret': None}
2018-04-17 17:27:18,247 [salt.state       ][INFO    ][16550] Completed state [osd.deploy] at time 17:27:18.247256 duration_in_ms=13715.02

Pillar for the disk: /dev/disk/by-id/scsi-35000cca25e49eaec: db: /dev/disk/by-id/nvme-INTEL_SSDPEDMD400G4_CVFT730500G6400LGN db_size: 50G format: bluestore wal: /dev/disk/by-id/nvme-INTEL_SSDPEDMD400G4_CVFT730500G6400LGN wal_size: 2G

Grain does not have that disk (/dev/sdc or /dev/disk/by-id/scsi-35000cca25e49eaec)

Any idea what the error 1 means in "return code: 1" for the command

PYTHONWARNINGS=ignore ceph-disk -v prepare --bluestore --data-dev --journal-dev --cluster ceph --cluster-uuid b72520db-a5fa-3a77-b236-27f522ca81e4 --block.wal
 /dev/nvme0n1p13 --block.db /dev/nvme0n1p14 /dev/sdc

that is executed?

Martin-Weiss commented 6 years ago

minion.log

Martin-Weiss commented 6 years ago
dd-sesp-01:/dev/disk/by-id # PYTHONWARNINGS=ignore ceph-disk -v prepare --bluestore --data-dev --journal-dev --cluster ceph --cluster-uuid b72520db-a5fa-3a77-b236-27f522ca81e4 --block.wal /dev/nvme0n1p7 --block.db /dev/nvme0n1p8 /dev/sdcget_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
set_type: Will colocate block with data on /dev/sdc
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_db_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_wal_size
get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_type
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_type
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
set_data_partition: Creating osd partition on /dev/sdc
get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
ptype_tobe_for_name: name = data
get_dm_uuid: get_dm_uuid /dev/sdc uuid path is /sys/dev/block/8:32/dm/uuid
create_partition: Creating data partition num 1 size 100 on /dev/sdc
command_check_call: Running command: /usr/sbin/sgdisk --new=1:0:+100M --change-name=1:ceph data --partition-guid=1:4576c197-6bc7-4f90-b000-979c6effdace --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdc
Caution: invalid main GPT header, but valid backup; regenerating main header
from backup!

Invalid partition data!
'/usr/sbin/sgdisk --new=1:0:+100M --change-name=1:ceph data --partition-guid=1:4576c197-6bc7-4f90-b000-979c6effdace --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdc' failed with status code 2

--> it seems that there was a GPT partitioning on the disk one day in the past that was not cleaned up properly before!

Two questions:

  1. why does stage.3 not fail with an error / does not handle the return code 1
  2. could we get a disk zap added BEFORE ceph-disk -v prepare to ensure the disk is really really cleaned up before adding an OSD?
Martin-Weiss commented 6 years ago

P.S. the error handling is really required as otherwise the re-run of stage.3 will create additional not used partitions (for WAL and RocksDB) over and over again that then need to be cleaned up manually.

Maybe we could also add a feature in the future to find and cleanup "orphaned" partitions?