0-complexity / openvcloud

OpenvCloud
Other
2 stars 4 forks source link

Selfheal disk devicename change preventing ASDs to start automatically #1899

Open FastGeert opened 5 years ago

FastGeert commented 5 years ago

When disks die, or get removed, the next time a computer boots, the disks will get a different kernel name (sda, ...)

When this happens ASDs fail to start and trigger the following HEALTH CHECK errors: image

When this HC is in error, we need to check if this is caused by a disk devicename change:

1. Get a list of backends with ASD's in error: image

2. Get the ASD guid to restart: image

3. Restart the ASD image

To get the node guid for restarting the ASD you need list the nodes (https://ovs-be-g8-4.gig.tech/api/alba/nodes/?sort=ip&contents=node_id%2C_relations&discover=false&timestamp=1539180202498) and get its guid via looking it up using the disk guid

4. Analyze the result of restarting the ASD The previous call provided a task guid as a response. Using this task guid, poll for the result with the following call: https://ovs-be-g8-4.gig.tech/api/tasks/cd1e6539-6ee2-42df-9a71-28c50836159c/?timestamp=1539182979530

If the response contains UNIQUE constraint failed: disk.name like in the response below, then we should run the healing code in step 5 image

5. Heal the ASDs with the following piece of python

from source.dal.lists.disklist import DiskList
disks = DiskList.get_disks()
for d in disks:
    d.name = '{}_new'.format(d.name)
    d.save()

6. Restart the asd-manager systemctl restart asd-manager

7. Retrigger the healthcheck Make sure though that we do not go in an endless loop.