cloudfoundry / bosh-azure-cpi-release

BOSH Azure CPI
Apache License 2.0
63 stars 87 forks source link

Refine the retry logic in CPI #443

Closed bingosummer closed 3 years ago

bingosummer commented 6 years ago

CPI has many retries. Are they neccessary? Some retry timeout is too long (it is recommended by Azure API). Should we reduce the retry timeout?

ClaudiaBaur commented 4 years ago

If a certain VM size is not available in the Azure DC, the logic of the bosh-azure-cpi retries 10 times to create the VM with a retry sleep of 480 seconds after each failure. Finally, this leads to a returning time of 1h 40 minutes. Is there a way to make the amount of retries configurable?

see also: response.body: {"startTime":"2019-12-16T20:51:11.9137023+00:00","endTime":"2019-12-16T20:51:51.429896+00:00","status":"Failed","error":

{"code":"ZonalAllocationFailed","message":"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance"} ,"name":"246b7307-0e08-4e60-b2ff-d10c3d1e6af6"} W, 2019-12-16T20:51:52.212925 #118218 #47386358501200 WARN – [req_id cpi-803040]: check_completion - http_put fails for an AzureAsynInternalError. Will retry after 480 seconds.

I, 2019-12-16T22:33:44.776764 #118218 INFO – [req_id cpi-803040]: Finished create_vm in 6160.84 seconds

Thanks & best, Claudia

jastev commented 4 years ago

We should distinguish between a given VM family not being available generally (which can be known up front) from a capacity issue preventing a nominally provisionable request from being satisfied at the moment (which isn’t). For the former, we should detect that at request-time and abort without retry, whereas for the latter retries may be ultimately successful (especially in the repave case, where a VM has just been retired).

From: Claudia Baur notifications@github.com Sent: Tuesday, December 17, 2019 7:16 AM To: cloudfoundry/bosh-azure-cpi-release bosh-azure-cpi-release@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [cloudfoundry/bosh-azure-cpi-release] Refine the retry logic in CPI (#443)

If a certain VM size is not available in the Azure DC, the logic of the bosh-azure-cpi retries 10 times to create the VM with a retry sleep of 480 seconds after each failure. Finally, this leads to a returning time of 1h 40 minutes. Is there a way to make the amount of retries configurable?

see also: response.body: {"startTime":"2019-12-16T20:51:11.9137023+00:00","endTime":"2019-12-16T20:51:51.429896+00:00","status":"Failed","error":

{"code":"ZonalAllocationFailed","message":"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance"}http://aka.ms/allocation-guidance%22%7D ,"name":"246b7307-0e08-4e60-b2ff-d10c3d1e6af6"} W, 2019-12-16T20:51:52.212925 #118218 #47386358501200 WARN – [req_id cpi-803040]: check_completion - http_put fails for an AzureAsynInternalError. Will retry after 480 seconds.

I, 2019-12-16T22:33:44.776764 #118218 INFO – [req_id cpi-803040]: Finished create_vm in 6160.84 seconds

Thanks & best, Claudia

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcloudfoundry%2Fbosh-azure-cpi-release%2Fissues%2F443%3Femail_source%3Dnotifications%26email_token%3DABT6UPLYIMGVQBWU5KYN7BDQZDUKTA5CNFSM4FONKV6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHCWQAY%23issuecomment-566585347&data=02%7C01%7CJason.Stevens%40microsoft.com%7C785fa74168554a5d5ef808d7830402a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637121925563006174&sdata=UD8Gp%2BgKOWOebhC8t%2BXbKQwD0rXRkyscpU1YehiSvx0%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABT6UPOTL4BNLSBJVYKAYWDQZDUKTANCNFSM4FONKV6A&data=02%7C01%7CJason.Stevens%40microsoft.com%7C785fa74168554a5d5ef808d7830402a7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637121925563016168&sdata=ZjPDip9cZ5q%2FDwRO8oS9RRO4xXSO724I1QHzUhDAU%2BM%3D&reserved=0.

ClaudiaBaur commented 4 years ago

No retry in case of an 'out of capacity' issue would also be fine with us. Currently, the code behaviour of bosh-azure-cpi blocks everything due to the ways too long timeout in that case. If it helps, I can also provide the logs. Thanks, Claudia

bosh-admin-bot commented 3 years ago

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bosh-admin-bot commented 3 years ago

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.