CCI-MOC / esi

Elastic Secure Infrastructure project
6 stars 12 forks source link

Investigate Ironic rescue functionality #548

Closed tzumainn closed 2 weeks ago

tzumainn commented 1 month ago

Users may corrupt their nodes and want to salvage whatever they can. Ironic has a rescue functionality (https://docs.openstack.org/ironic/latest/admin/rescue.html) that may help with this.

tzumainn commented 1 month ago

Looks like we can get rescue images here: https://docs.openstack.org/ironic/latest/install/deploy-ramdisk.html

I'll test this out when I can!

joachimweyl commented 3 weeks ago

@tzumainn what is the next step for this? Is this still in progress?

tzumainn commented 3 weeks ago

yep, still something I'm working on!

tzumainn commented 3 weeks ago

running into networking issues; talking to the Ironic folks to see if there's a way around them

tzumainn commented 3 weeks ago

The solution for this turns out to be pretty complicated, since the delay in ansible networking means that the node is still on the rescue network by the time it's booted the rescue image, meaning those network interfaces still have the rescue interface IPs. I solved this issue by creating a new rescue image with a custom change to ironic-python-agent that restart the network interfaces a five minute delay (https://github.com/tzumainn/ironic-python-agent/commit/afff59cc281c3579eac4df7f697f0c31e7fc07dc).

The only other change needed is to set default rescue ramdisk and kernels in ironic.conf.

I still need to formalize this in documentation and updates to esi-pilot.

tzumainn commented 2 weeks ago

Usage documentation: https://github.com/CCI-MOC/esi/pull/563

tzumainn commented 2 weeks ago

Updated playbooks: https://github.com/CCI-MOC/esi-pilot/pull/67