Closed blackwood821 closed 12 months ago
In my most recent test I ran 1,442 reprovisions in a row and 478 of those reprovisions encountered issue number 2 where the existing mount retry that was already in place worked around the issue but then there were 7 reprovisions where the first mount retry did not succeed and the second retry I added in did succeed. This shows that the extra retries are necessary though infrequent.
Also, OS-8486 might be redundant given https://smartos.org/bugview/TRITON-1441
Superseded by #1078.
Issue 1:
This was happening frequently on a few of our CNs so I changed the code that unmounts the cores dataset to use the dataset name rather than the mountpoint and this seems to have resolved the issue. I have been unable to recreate the issue after the code change.
Issue 2:
This happens occasionally and I noticed there was already 1 retry in place from this commit: https://github.com/joyent/smartos-live/commit/b705c35be509cb4f0f65fff615ae226d8a02f3ab and most of the time that 1 retry works around the issue but occasionally additional retries are required. I modified the code to retry up to 10 times and it always seems to work within the first 3 retries.
Issue 3:
Occasionally
zone_state
is stilldown
rather thaninstalled
whilestate
isprovisioning
. I added up to 10 retries and it seems to always work by the first retry.Issue 4:
Occasionally the VM GET response from
vminfod
does not include all fields (see https://gist.github.com/blackwood821/3cb90334fc2ead340a91d4fddf9ffb68 for more details), specifically thezpool
field which is required to validate the image before reprovisioning. If thezpool
field is unset I added a call toVM.load()
and pass"loadManually": true
which forces the call to bypassvminfod
which seems to fix the issue.I have tested thousands of reprovisions with these changes and they hold up in our environment.