OS-8486 Ocassional issues when reprovisioning zones

blackwood821 commented 3 years ago

Issue 1:

Command failed: cannot mount 'zones/cores/<vm_uuid>': filesystem already mounted

This was happening frequently on a few of our CNs so I changed the code that unmounts the cores dataset to use the dataset name rather than the mountpoint and this seems to have resolved the issue. I have been unable to recreate the issue after the code change.

Issue 2:

Command failed: cannot mount 'zones/<vm_uuid>': mountpoint or dataset is busy
clone successfully created, but not mounted

This happens occasionally and I noticed there was already 1 retry in place from this commit: https://github.com/joyent/smartos-live/commit/b705c35be509cb4f0f65fff615ae226d8a02f3ab and most of the time that 1 retry works around the issue but occasionally additional retries are required. I modified the code to retry up to 10 times and it always seems to work within the first 3 retries.

Issue 3:

Error: Cannot start vm from state "provisioning", must be "stopped".

Occasionally zone_state is still down rather than installed while state is provisioning. I added up to 10 retries and it seems to always work by the first retry.

Issue 4:

Command failed: imgadm get: error (ImageNotInstalled): image "<image_uuid>" was not found on zpool "undefined"

Occasionally the VM GET response from vminfod does not include all fields (see https://gist.github.com/blackwood821/3cb90334fc2ead340a91d4fddf9ffb68 for more details), specifically the zpool field which is required to validate the image before reprovisioning. If the zpool field is unset I added a call to VM.load() and pass "loadManually": true which forces the call to bypass vminfod which seems to fix the issue.

I have tested thousands of reprovisions with these changes and they hold up in our environment.

blackwood821 commented 3 years ago

In my most recent test I ran 1,442 reprovisions in a row and 478 of those reprovisions encountered issue number 2 where the existing mount retry that was already in place worked around the issue but then there were 7 reprovisions where the first mount retry did not succeed and the second retry I added in did succeed. This shows that the extra retries are necessary though infrequent.

danmcd commented 1 year ago

Also, OS-8486 might be redundant given https://smartos.org/bugview/TRITON-1441

bahamat commented 12 months ago

Superseded by #1078.

TritonDataCenter / smartos-live

OS-8486 Ocassional issues when reprovisioning zones #991