Replace a failed OSD Drive procedure.

grharry commented 2 years ago

Hello ceph people!

Help required here. It's not clear at all to me how to replace a failed Disk Drive ( luminus ). In Other Words, I Cannot locate a clear procedure or a how-to stating the steps to replace failed drives using ansible from a CEPH storage system.

Am I the only one with this problem ? Regards, Harry.

jeevadotnet commented 2 years ago

Hi @grharry

I use ceph-ansible on an almost weekly basis to replace one of our thousands of drives.

I'm currently running pacific, but started of the cluster on luminous before eventually upgrading it, but the process is basically the same.

You will have issues if your inventory's disk placement is not 100% configured correctly, in my instance I used manual disk placements and not osd_autodiscovery from osds.yml (since osd_autodiscovery is not recommended).

E.g. one node would be in your inventory

[nodepool02]
B-02-40-cephosd.maas osd_objectstore=bluestore devices="[ '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdi', '/dev/sdj', '/dev/sdk', '/dev/sdl', '/dev/sdm', '/dev/sdn', '/dev/sdo', '/dev/sdp', '/dev/sdq', '/dev/sdr', '/dev/sds', '/dev/sdt', '/dev/sdu', '/dev/sdv', '/dev/sdw', '/dev/sdx' ]" dedicated_devices="[ '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1' ]"

You will require the same amount of bluestore_devices for every dedicated_devices (rockswal/db offload), otherwise you will have issues with the playbook. However note, your cluster must have been built like this from the start. If it is not a 1:1 ratio you will need to configure it as how you've initially built it. i.e. your inventory must match your nodes & disks 100% as in the current setup.

If you're sure everything is correct you can just run:

ansible-playbook -i /opt/ceph-ansible/inventory -e 'ansible_python_interpreter=/usr/bin/python3' infrastructure-playbooks/shrink-osd.yml -e osd_to_kill=1592

Where osd_to_kill is your corresponding OSD that is faulty. This will format the disk, and remove it from ceph crush/ceph completely, no other action required.

If the disk is still readable from time to time, i.e. not completely dead, I generally weigh it out manually, ceph osd reweight osd.x 0 which is the equivalent of setting the osd out. It will still try to read data from the disk while draining it for quicker backfilling. Once it is drained, I run shrink-osd.yml. In the event that the disk is completely dead just run the shrink-osd.yml as above.

If you don't know the faulty OSD and only the harddrive serial number from IPMI/IDRAC/ILO you can use ceph device ls | grep 'serial number' to get the corresponding OSD value, or ceph-volume inventory from within one of your osds/mds/mon/mgr docker containers exec -it ... /bin/bash

Once the disk is replaced with a new one you can run ansible-playbook -i /opt/ceph-ansible/inventory -e 'ansible_python_interpreter=/usr/bin/python3' site-container.yml --limit=HOSTwithNEWdisk , --limit= it to the ceph osd node with the new/replaced disk.

If you run site-container.yml and it completes successfully but you don't see your new(replacement OSD) added to ceph, I generally ssh to the host with the new disk and run an lsblk -p , find the new disk /dev value and run, dd if=/dev/zero of=/dev/sdX bs=1M to format it

Then proceed to run: ansible-playbook -i /opt/ceph-ansible/inventory -e 'ansible_python_interpreter=/usr/bin/python3' site-container.yml --limit=HOSTwithNEWdisk, again.

ansible_python_interpreter is not required, depending on your environment

Hope it helps, there are no clear guide, this is from a couple of years of figuring stuff out and assistant from @guits on IRC from time to time.

Update: (I remembered a couple of things)

Take into account that ceph osd reweight osd.x 0 will cause backfilling, and if you shrink-osd afterwards, it will backfill again, since you drained the disk but not the PG crush map. You can manually change the CRUSH values that with ceph osd crush reweight osd.x 0 but I don't.

You can control the speed of the amount of backfills with: ceph tell osd.* injectargs '--osd_max_backfills 1' Increasing it will increase the amount of PG backfills per OSD, however it will put more strain on the disks and you might get slow pgs if you take it up too high (high on magnetic for me is, anything over 7+). From experience my magnetic disks start to fall over / crash if it goes higher than 8.

For recovery processes you can use: ceph tell osd.* injectargs '--osd_recovery_max_active 1' ceph tell osd.* injectargs '--osd_recovery_op_priority 1' (i think this value goes up to 254`

In case you set it accidentally too high and want to revert back, a quick way is to run the upmap script from: https://gitlab.cern.ch/ceph/ceph-scripts/-/blob/master/tools/upmap/upmap-remapped.py upmap-remapped.py | sh

In actual fact I always run this script after adding a new disk or "resetting my backfills", so that it clears the PG's to 100% active+clean and gradually weigh in the new disk.

Also use ceph balancer on and ceph balancer status.

grharry commented 2 years ago

Wow !!! At LAST! Thank you so much !!!! I owe U at least a BEER !!! Regards, Harry!

guits commented 2 years ago

E.g. one node would be in your inventory


[nodepool02]
B-02-40-cephosd.maas osd_objectstore=bluestore devices="[ '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdi', '/dev/sdj', '/dev/sdk', '/dev/sdl', '/dev/sdm', '/dev/sdn', '/dev/sdo', '/dev/sdp', '/dev/sdq', '/dev/sdr', '/dev/sds', '/dev/sdt', '/dev/sdu', '/dev/sdv', '/dev/sdw', '/dev/sdx' ]" dedicated_devices="[ '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1' ]"

@jeevadotnet I think having as many dedicated_devices as devices is no longer a requirement. This 1:1 relation disappeared after we dropped ceph-disk support (stable-4.0). If you take a look at this task in the role ceph-facts you will see that we use the filter | unique

https://github.com/ceph/ceph-ansible/blob/f288364c5c268e61d49165704900e8e01ca643c8/roles/ceph-facts/tasks/devices.yml#L45-L50

guits commented 2 years ago

@jeevadotnet very useful feedback. I'm considering writing a documentation ouf of it 🙂 Thanks!

jeevadotnet commented 2 years ago

E.g. one node would be in your inventory
[nodepool02]
B-02-40-cephosd.maas osd_objectstore=bluestore devices="[ '/dev/sda', '/dev/sdb', '/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf', '/dev/sdg', '/dev/sdh', '/dev/sdi', '/dev/sdj', '/dev/sdk', '/dev/sdl', '/dev/sdm', '/dev/sdn', '/dev/sdo', '/dev/sdp', '/dev/sdq', '/dev/sdr', '/dev/sds', '/dev/sdt', '/dev/sdu', '/dev/sdv', '/dev/sdw', '/dev/sdx' ]" dedicated_devices="[ '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1', '/dev/nvme0n1' ]"
@jeevadotnet I think having as many dedicated_devices as devices is no longer a requirement. This 1:1 relation disappeared after we dropped ceph-disk support (stable-4.0). If you take a look at this task in the role ceph-facts you will see that we use the filter | unique

https://github.com/ceph/ceph-ansible/blob/f288364c5c268e61d49165704900e8e01ca643c8/roles/ceph-facts/tasks/devices.yml#L45-L50

Will try it today when rebuilding testbed with your recommendation as per #7283 , however maybe I did it previously wrong, but tested luminous and octopus with the many:1 relation but then it only created a partition for my /dev/sda.

@jeevadotnet very useful feedback. I'm considering writing a documentation ouf of it 🙂 Thanks!

Haha, pleasure, you always great in helping out here and teaching me the 'way of the ceph', so I, eventually over time, have learned enough to be able to reply to other people's issues.

grharry commented 2 years ago

Hello Again! I need some clarification. On my ceph instalation ( luminous ). I've got 4 OSDs DOWN ( waiting for replacement drives )
and my current pg status shows 5 pgs inactive, 3 pgs down, 2 pgs peering, 12 pgs stale

querying the down pg's they seem to be stuck by the dead OSD's

Do I proceed with ..... shrink-osd.yml -e osd_to_kill=xxx first? or ceph pg force_create_pg with the id's of down pgs ?? Thank's again for your help! Harry.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

ceph / ceph-ansible

Replace a failed OSD Drive procedure. #7282