add-osd.yaml failing on stable-4.0 and master branches

eugenkoenig commented 5 years ago

Bug Report

What happened:

Cluster was running Mimic with xfs disks (non-containerized)
Cluster was upgraded to Nautilus with rolling update
If we try to add a new OSD with the add-osd.yaml playbook, it crashes because of GTP table error on all nodes:

TASK [ceph-config : run 'ceph-volume lvm batch --report' to see how many osds are to be created] *******
Thursday 06 June 2019  14:56:23 +0200 (0:00:00.112)       0:01:33.426 ********* 

fatal: [ceph3]: FAILED! => changed=true 
  cmd:
  - ceph-volume
  - --cluster
  - ceph
  - lvm
  - batch
  - --bluestore
  - --yes
  - /dev/vdb
  - /dev/vdc
  - /dev/vdd
  - /dev/vde
  - --report
  - --format=json
  msg: non-zero return code
  rc: 2
  stderr: |-
    usage: ceph-volume lvm batch [-h] [--db-devices [DB_DEVICES [DB_DEVICES ...]]]
                                 [--wal-devices [WAL_DEVICES [WAL_DEVICES ...]]]
                                 [--journal-devices [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]]
                                 [--no-auto] [--bluestore] [--filestore]
                                 [--report] [--yes] [--format {json,pretty}]
                                 [--dmcrypt]
                                 [--crush-device-class CRUSH_DEVICE_CLASS]
                                 [--no-systemd]
                                 [--osds-per-device OSDS_PER_DEVICE]
                                 [--block-db-size BLOCK_DB_SIZE]
                                 [--block-wal-size BLOCK_WAL_SIZE]
                                 [--journal-size JOURNAL_SIZE] [--prepare]
                                 [--osd-ids [OSD_IDS [OSD_IDS ...]]]
                                 [DEVICES [DEVICES ...]]
    ceph-volume lvm batch: error: GPT headers found, they must be removed on: /dev/vdb
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
fatal: [ceph2]: FAILED! => changed=true 
  cmd:
  - ceph-volume
  - --cluster
  - ceph
  - lvm
  - batch
  - --bluestore
  - --yes
  - /dev/vdb
  - /dev/vdc
  - /dev/vdd
  - /dev/vde
  - --report
  - --format=json
  msg: non-zero return code
  rc: 2
  stderr: |-
    usage: ceph-volume lvm batch [-h] [--db-devices [DB_DEVICES [DB_DEVICES ...]]]
                                 [--wal-devices [WAL_DEVICES [WAL_DEVICES ...]]]
                                 [--journal-devices [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]]
                                 [--no-auto] [--bluestore] [--filestore]
                                 [--report] [--yes] [--format {json,pretty}]
                                 [--dmcrypt]
                                 [--crush-device-class CRUSH_DEVICE_CLASS]
                                 [--no-systemd]
                                 [--osds-per-device OSDS_PER_DEVICE]
                                 [--block-db-size BLOCK_DB_SIZE]
                                 [--block-wal-size BLOCK_WAL_SIZE]
                                 [--journal-size JOURNAL_SIZE] [--prepare]
                                 [--osd-ids [OSD_IDS [OSD_IDS ...]]]
                                 [DEVICES [DEVICES ...]]
    ceph-volume lvm batch: error: GPT headers found, they must be removed on: /dev/vdb
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
fatal: [ceph1]: FAILED! => changed=true 
  cmd:
  - ceph-volume
  - --cluster
  - ceph
  - lvm
  - batch
  - --bluestore
  - --yes
  - /dev/vdb
  - /dev/vdc
  - /dev/vdd
  - /dev/vde
  - --report
  - --format=json
  msg: non-zero return code
  rc: 2
  stderr: |-
    usage: ceph-volume lvm batch [-h] [--db-devices [DB_DEVICES [DB_DEVICES ...]]]
                                 [--wal-devices [WAL_DEVICES [WAL_DEVICES ...]]]
                                 [--journal-devices [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]]
                                 [--no-auto] [--bluestore] [--filestore]
                                 [--report] [--yes] [--format {json,pretty}]
                                 [--dmcrypt]
                                 [--crush-device-class CRUSH_DEVICE_CLASS]
                                 [--no-systemd]
                                 [--osds-per-device OSDS_PER_DEVICE]
                                 [--block-db-size BLOCK_DB_SIZE]
                                 [--block-wal-size BLOCK_WAL_SIZE]
                                 [--journal-size JOURNAL_SIZE] [--prepare]
                                 [--osd-ids [OSD_IDS [OSD_IDS ...]]]
                                 [DEVICES [DEVICES ...]]
    ceph-volume lvm batch: error: GPT headers found, they must be removed on: /dev/vdb
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

How to reproduce it (minimal and precise):

Update your Mimic cluster to Nautilus with the rolling update
Try to add on OSD with the add-osd.yaml playbook

Environment:

OS (e.g. from /etc/os-release): 18.04.2 LTS (Bionic Beaver)
Kernel (e.g. uname -a): Linux ceph1 4.15.0-51-generic
Ansible version (e.g. ansible-playbook --version): ansible 2.8.0
ceph-ansible version (e.g. git head or tag or stable branch): stable-4.0 / master
Ceph version (e.g. ceph -v): ceph version 14.2.1

dsavineau commented 5 years ago

Could you share the content of group_vars/*.yml ?

Does the mimic deployment was already using ceph-volume (osd_scenario: lvm) ? AFAIK the gpt header is only present on device deployed by ceph-disk and ceph-disk isn't present anymore in ceph nautilus / stable-4.0.

eugenkoenig commented 5 years ago

@dsavineau it was deployed with collocated scenario, so it was using ceph-disk. But how can we use add-osd.yaml then? Is it enough to just replace osd_scenario: collocated with osd_scenario: lvm?

group_vars/ceph.yaml:

ceph_mirror: http://download.ceph.com
ceph_origin: repository
ceph_repository: community
ceph_stable: true
ceph_stable_key: https://download.ceph.com/keys.release.asc
ceph_stable_release: nautilus
ceph_stable_repo: "{{ ceph_mirror }}/debian-{{ ceph_stable_release }}"
upgrade_ceph_packages: True

cluster: ceph
ceph_conf_key_directory: /etc/ceph
fetch_directory: fetch/
ntp_service_enabled: true

configure_firewall: false

mon_group_name: mons
osd_group_name: osds
mgr_group_name: mgrs
rgw_group_name: rgws

ceph_mgr_modules:
  - status
  - dashboard

monitor_interface: ens18
radosgw_interface: ens18

nfs_ganesha_stable: true
nfs_ganesha_stable_branch: V2.7-stable
nfs_ganesha_stable_deb_repo: "[trusted=yes] https://chacra.ceph.com/r/nfs-ganesha-stable/V2.7-stable/2356c3867730696aacc31874357b3499062fc902/ubuntu/bionic/flavors/ceph_nautilus"
nfs_file_gw: false
nfs_obj_gw: true
ceph_nfs_log_file: "/var/log/ganesha/ganesha.log"

group_vars/osds.yaml

osd_scenario: collocated
devices:
  - '/dev/vdb'
  - '/dev/vdc'
  - '/dev/vdd'
  - '/dev/vde'

osd_mkfs_type: xfs
osd_objectstore: bluestore

guits commented 5 years ago

@andrewschoen shouldn't ceph-volume lvm batch --report simply ignore disks when it sees GPT header instead of failing like this?

eugenkoenig commented 5 years ago

@guits true, that's a behavior I'd expect. We found, that OSDs created with ceph-disk can be easily converted to lvm (ceph-volume) by following the guide in ceph docs [1], however it's okay for small clusters, but too much effort for clusters with hundreds of OSDs. Either ignoring them or run special playbook would be better approach.

[1] http://docs.ceph.com/docs/nautilus/rados/operations/add-or-rm-osds/#replacing-an-osd

andrewschoen commented 5 years ago

@andrewschoen shouldn't ceph-volume lvm batch --report simply ignore disks when it sees GPT header instead of failing like this?

We did this purposefully because batch wants to make sure all drives given to it are usable, if they are not, they are rejected. The disks would need to be zapped and/or all GPT headers removed before given to lvm batch.

guits commented 5 years ago

@andrewschoen IMO, upgrading from mimic with ceph-disk prepared OSDs to nautilus shouldn't require users have to manually do something regarding already deployed OSDs. As @styleart said, that could eventually be acceptable for small cluster, but what about large cluster with hundred OSDs already deployed? This doesn't ease the ceph-disk to ceph-volume transition.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ceph / ceph-ansible

add-osd.yaml failing on stable-4.0 and master branches #4055