SUSE / DeepSea

A collection of Salt files for deploying, managing and automating Ceph.
GNU General Public License v3.0
160 stars 75 forks source link

osd.remove fails to zap devices on ceph version 14.2.3-349 #1747

Open jschmid1 opened 4 years ago

jschmid1 commented 4 years ago

ceph version 14.2.3-349-g7b1552ea82 (7b1552ea827cf5167b6edbba96dd1c4a9dc16937) nautilus (stable)

salt-run osd.remove $id

uses ceph-volume lvm zap --osd-id $id --destroy to zap a disk remotely on the minion.

In previous releases we expected a string Zapping successful for OSD in the return message.

With this release we get: --> Zapping: /dev/ceph-a8e4a78d-e3....

Since there are no significant changes that would indicate a change in the return string I assume it's due to the changes in logging in recent commits. mlogger vs terminal.success

This raises the question if invoking shell commands is the right way when we have an python api to use. A couple of things need to be verified though.

1) Is the zap command consumable via the API 2) Does it return meaningful messages 3) Is it more efficient since it needs to be wrapped in a minion module to be called from the master (via a runner)

smithfarm commented 4 years ago

Ideally, DeepSea will still work with earlier versions of nautilus even after this is fixed.

jschmid1 commented 4 years ago

maybe @jan--f can confirm my assumption regarding the logging changes.

jan--f commented 4 years ago

14.2.3 ceph-volume should only print logging messages to stderr. I guess the runner returns stderr? I'll look into it.

jan--f commented 4 years ago

hmm I just saw this: I ran salt '*' cmd.run 'for d in b c d e f; do ceph-volume lvm zap --destroy /dev/vd$d; done' from the salt master and got output like

data2-6.virt1.home.fajerski.name:
    --> Zapping: /dev/vdb
    --> Zapping: /dev/vdc
    --> Zapping: /dev/vdd
    --> Zapping: /dev/vde
    --> Zapping: /dev/vdf

But a subsequent lsblk revealed that no LV got zapped. Running the same command directly on the minion (minus the salt part of course) zapped the disks just fine. No idea what is going on here, will investigate more tomorrow.

jan--f commented 4 years ago

Here is the issue


[2019-09-17 11:01:39,104][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 148, in main
    terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 205, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 40, in main
    terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 205, in dispatch
    instance.main()
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 355, in main
    self.zap()
  File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
    return func(*a, **kw)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 233, in zap
    self.zap_lvm_member(device)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 198, in zap_lvm_member
    self.zap_lv(Device(lv.lv_path))
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 144, in zap_lv
    self.unmount_lv(lv)
  File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/zap.py", line 133, in unmount_lv
    mlogger.info("Unmounting %s", lv_path)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 190, in info
    info(record)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 142, in info
    return _Write(prefix=blue_arrow).raw(msg)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 117, in raw
    self.write(string)
  File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 120, in write
    self._writer.write(self.prefix + line + self.suffix)
ValueError: I/O operation on closed file.
smithfarm commented 4 years ago

so, it would seem, that in the CI tests which are now passing with the tmp fix, the OSDs are being removed but the underlying disk is not really getting zapped (AFAIK the tests do not include any logic for verifying the zap)

jan--f commented 4 years ago

so, it would seem, that in the CI tests which are now passing with the tmp fix, the OSDs are being removed but the underlying disk is not really getting zapped (AFAIK the tests do not include any logic for verifying the zap)

that would be entirely plausible.

jan--f commented 4 years ago

ok I can confirm that the ceph build fixes this issue. However there still seems to somehting up with purge, where the OSDs are not stopped, but once they are stopped, zapping them works just fine.


salt 'data1*' cmd.run 'for d in b c d e f; do ceph-volume lvm zap --destroy /dev/vd$d; echo $?; done'
data1-6.virt1.home.fajerski.name:
    --> Zapping: /dev/vdb
    --> Unmounting /var/lib/ceph/osd/ceph-3
    Running command: /bin/umount -v /var/lib/ceph/osd/ceph-3
     stderr: umount: /var/lib/ceph/osd/ceph-3 unmounted
    Running command: /usr/sbin/wipefs --all /dev/ceph-58736349-a5d6-4966-9598-f7ed4082441b/osd-data-bcb8791a-954d-417b-b016-fa08d9a62885
    Running command: /bin/dd if=/dev/zero of=/dev/ceph-58736349-a5d6-4966-9598-f7ed4082441b/osd-data-bcb8791a-954d-417b-b016-fa08d9a62885 bs=1M count=10
    --> Only 1 LV left in VG, will proceed to destroy volume group ceph-58736349-a5d6-4966-9598-f7ed4082441b
    Running command: /usr/sbin/vgremove -v -f ceph-58736349-a5d6-4966-9598-f7ed4082441b
     stderr: Removing ceph--58736349--a5d6--4966--9598--f7ed4082441b-osd--data--bcb8791a--954d--417b--b016--fa08d9a62885 (253:2)
     stderr: Archiving volume group "ceph-58736349-a5d6-4966-9598-f7ed4082441b" metadata (seqno 21).
        Releasing logical volume "osd-data-bcb8791a-954d-417b-b016-fa08d9a62885"
     stderr: Creating volume group backup "/etc/lvm/backup/ceph-58736349-a5d6-4966-9598-f7ed4082441b" (seqno 22).
     stdout: Logical volume "osd-data-bcb8791a-954d-417b-b016-fa08d9a62885" successfully removed
     stderr: Removing physical volume "/dev/vdb" from volume group "ceph-58736349-a5d6-4966-9598-f7ed4082441b"
     stdout: Volume group "ceph-58736349-a5d6-4966-9598-f7ed4082441b" successfully removed
    Running command: /usr/sbin/wipefs --all /dev/vdb
     stdout: /dev/vdb: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
    Running command: /bin/dd if=/dev/zero of=/dev/vdb bs=1M count=10
    --> Zapping successful for: <Raw Device: /dev/vdb>
    0
    --> Zapping: /dev/vdc
    --> Unmounting /var/lib/ceph/osd/ceph-9
    Running command: /bin/umount -v /var/lib/ceph/osd/ceph-9
     stderr: umount: /var/lib/ceph/osd/ceph-9 unmounted
    Running command: /usr/sbin/wipefs --all /dev/ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39/osd-data-b8281429-e4d0-4d6e-ac64-38dae8d1a270
    Running command: /bin/dd if=/dev/zero of=/dev/ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39/osd-data-b8281429-e4d0-4d6e-ac64-38dae8d1a270 bs=1M count=10
    --> Only 1 LV left in VG, will proceed to destroy volume group ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39
    Running command: /usr/sbin/vgremove -v -f ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39
     stderr: Removing ceph--df617cae--42bb--43ca--97ba--0d01b8ef2d39-osd--data--b8281429--e4d0--4d6e--ac64--38dae8d1a270 (253:4)
     stderr: Archiving volume group "ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39" metadata (seqno 21).
        Releasing logical volume "osd-data-b8281429-e4d0-4d6e-ac64-38dae8d1a270"
     stderr: Creating volume group backup "/etc/lvm/backup/ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39" (seqno 22).
     stdout: Logical volume "osd-data-b8281429-e4d0-4d6e-ac64-38dae8d1a270" successfully removed
     stderr: Removing physical volume "/dev/vdc" from volume group "ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39"
     stdout: Volume group "ceph-df617cae-42bb-43ca-97ba-0d01b8ef2d39" successfully removed
    Running command: /usr/sbin/wipefs --all /dev/vdc
     stdout: /dev/vdc: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
    Running command: /bin/dd if=/dev/zero of=/dev/vdc bs=1M count=10
    --> Zapping successful for: <Raw Device: /dev/vdc>
    0
    --> Zapping: /dev/vdd
    --> Unmounting /var/lib/ceph/osd/ceph-14
    Running command: /bin/umount -v /var/lib/ceph/osd/ceph-14
     stderr: umount: /var/lib/ceph/osd/ceph-14 unmounted
    Running command: /usr/sbin/wipefs --all /dev/ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d/osd-data-65a0439a-9543-4e7b-a94f-11a2ff373241
    Running command: /bin/dd if=/dev/zero of=/dev/ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d/osd-data-65a0439a-9543-4e7b-a94f-11a2ff373241 bs=1M count=10
    --> Only 1 LV left in VG, will proceed to destroy volume group ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d
    Running command: /usr/sbin/vgremove -v -f ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d
     stderr: Removing ceph--686e5dc6--00e8--46ba--b109--d93623a9f60d-osd--data--65a0439a--9543--4e7b--a94f--11a2ff373241 (253:0)
     stderr: Archiving volume group "ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d" metadata (seqno 21).
        Releasing logical volume "osd-data-65a0439a-9543-4e7b-a94f-11a2ff373241"
     stderr: Creating volume group backup "/etc/lvm/backup/ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d" (seqno 22).
     stdout: Logical volume "osd-data-65a0439a-9543-4e7b-a94f-11a2ff373241" successfully removed
     stderr: Removing physical volume "/dev/vdd" from volume group "ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d"
     stdout: Volume group "ceph-686e5dc6-00e8-46ba-b109-d93623a9f60d" successfully removed
    Running command: /usr/sbin/wipefs --all /dev/vdd
     stdout: /dev/vdd: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
    Running command: /bin/dd if=/dev/zero of=/dev/vdd bs=1M count=10
    --> Zapping successful for: <Raw Device: /dev/vdd>
    0
    --> Zapping: /dev/vde
    Running command: /usr/sbin/wipefs --all /dev/vde
     stdout: /dev/vde: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
    Running command: /bin/dd if=/dev/zero of=/dev/vde bs=1M count=10
    --> Zapping successful for: <Raw Device: /dev/vde>
    0
    --> Zapping: /dev/vdf
    --> Unmounting /var/lib/ceph/osd/ceph-24
    Running command: /bin/umount -v /var/lib/ceph/osd/ceph-24
     stderr: umount: /var/lib/ceph/osd/ceph-24 unmounted
    Running command: /usr/sbin/wipefs --all /dev/ceph-860f342f-8a0c-4746-a906-0eb32c9847dc/osd-data-9c48ec80-46f9-4184-ad77-fd4ac6e94126
    Running command: /bin/dd if=/dev/zero of=/dev/ceph-860f342f-8a0c-4746-a906-0eb32c9847dc/osd-data-9c48ec80-46f9-4184-ad77-fd4ac6e94126 bs=1M count=10
    --> Only 1 LV left in VG, will proceed to destroy volume group ceph-860f342f-8a0c-4746-a906-0eb32c9847dc
    Running command: /usr/sbin/vgremove -v -f ceph-860f342f-8a0c-4746-a906-0eb32c9847dc
     stderr: Removing ceph--860f342f--8a0c--4746--a906--0eb32c9847dc-osd--data--9c48ec80--46f9--4184--ad77--fd4ac6e94126 (253:1)
     stderr: Archiving volume group "ceph-860f342f-8a0c-4746-a906-0eb32c9847dc" metadata (seqno 21).
        Releasing logical volume "osd-data-9c48ec80-46f9-4184-ad77-fd4ac6e94126"
     stderr: Creating volume group backup "/etc/lvm/backup/ceph-860f342f-8a0c-4746-a906-0eb32c9847dc" (seqno 22).
     stdout: Logical volume "osd-data-9c48ec80-46f9-4184-ad77-fd4ac6e94126" successfully removed
     stderr: Removing physical volume "/dev/vdf" from volume group "ceph-860f342f-8a0c-4746-a906-0eb32c9847dc"
     stdout: Volume group "ceph-860f342f-8a0c-4746-a906-0eb32c9847dc" successfully removed
    Running command: /usr/sbin/wipefs --all /dev/vdf
     stdout: /dev/vdf: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
    Running command: /bin/dd if=/dev/zero of=/dev/vdf bs=1M count=10
    --> Zapping successful for: <Raw Device: /dev/vdf>
    0
smithfarm commented 4 years ago

I can also confirm that this problem doesn't happen with 14.2.4, which would indicate this just another symptom of the ceph-volume regression that found its way into 14.2.3.

smithfarm commented 4 years ago

. . . and since users on SUSE will not see 14.2.3, there's no reason for DeepSea to do anything special to work around that ceph-volume regression.