cloudfoundry / bosh-alicloud-cpi-release

BOSH Alibaba CPI
Apache License 2.0
32 stars 20 forks source link

bosh did not cleanup VM resources on failure #74

Closed OlegGerber closed 3 years ago

OlegGerber commented 5 years ago

Hi, It looks like bosh did not cleanup the resources on create_vm failures. create-vm failed on timeout + failed delete, and second attempt to deploy the same VM failed on "private IP address is duplicated" error. CPI Release Version: v22

Please note I removed the password and keys from the logs/ First attempt: {code} Deploying:

Creating instance 'bosh/0':

Creating VM:

  Creating vm with stemcell cid 'm-uf6b6vtpujfo9zq2104i':

    CPI 'create_vm' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"input=`{\"method\": \"create_vm\",\"arguments\": [\"dc87ab78-3ba7-4b92-5f30-77bf31aa7c85\",\"m-uf6b6vtpujfo9zq2104i\",{\"availability_zone\": \"cn-shanghai-e\",\"ephemeral_disk\": {\"category\": \"cloud_efficiency\",\"encrypted\": true,\"size\": \"51_200\"},\"instance_type\": \"ecs.c5.xlarge\",\"key_pair_name\": \"bosh-ssh-cf-canary2019050202420793370000000d\",\"system_disk\": {\"category\": \"cloud_efficiency\",\"encrypted\": true,\"size\": \"51_200\"}},{\"private\": {\"cloud_properties\": {\"security_group_ids\": [\"sg-uf6866edq67pmgnxkaz7\"],\"vswitch_id\": \"vsw-uf6jkcjaf400oeb5rgsx7\"},\"default\": [\"dns\",\"gateway\"],\"dns\": [\"100.100.2.136\",\"100.100.2.138\"],\"gateway\": \"10.0.3.1\",\"ip\": \"10.0.3.12\",\"netmask\": \"255.255.255.0\",\"type\": \"manual\"}},[],{\"bosh\": {\"mbus\": {\"cert\": {\"ca\": \"xxx\"}}],\"context\": {\"director_uuid\": \"305e8ff0-2651-488f-4c6c-a5fc711189b4\",\"vm\": {\"stemcell\": {\"api_version\": 2}}},\"api_version\": 1}` update i-uf61h0f16m81g4i1p7q9 failed delete failed [SDK.TimeoutError] The request timed out 6 times(6 for retry), perhaps we should have the threshold raised a little?\ncaused by:\nPost https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=Action=DescribeInstanceAttribute\u0026Format=JSON\u0026InstanceId=i-uf61h0f16m81g4i1p7q9\u0026RegionId=cn-shanghai\u0026SecurityToken=\u0026Signature=xxx!D(MISSING)\u0026SignatureMethod=HMAC-SHA1\u0026SignatureNonce=0c63d8aa681c45c18547ac9213b8cf0b\u0026SignatureType=\u0026SignatureVersion=1.0\u0026Timestamp=2019-05-03T09%!A(MISSING)46%!A(MISSING)38Z\u0026Version=2014-05-26: net/http: request canceled (Client.Timeout exceeded while awaiting headers): input=`{\"method\": \"create_vm\",\"arguments\": [\"dc87ab78-3ba7-4b92-5f30-77bf31aa7c85\",\"m-uf6b6vtpujfo9zq2104i\",{\"availability_zone\": \"cn-shanghai-e\",\"ephemeral_disk\": {\"category\": \"cloud_efficiency\",\"encrypted\": true,\"size\": \"51_200\"},\"instance_type\": \"ecs.c5.xlarge\",\"key_pair_name\": \"bosh-ssh-cf-canary2019050202420793370000000d\",\"system_disk\": {\"category\": \"cloud_efficiency\",\"encrypted\": true,\"size\": \"51_200\"}},{\"private\": {\"cloud_properties\": {\"security_group_ids\": [\"sg-uf6866edq67pmgnxkaz7\"],\"vswitch_id\": \"vsw-uf6jkcjaf400oeb5rgsx7\"},\"default\": [\"dns\",\"gateway\"],\"dns\": [\"100.100.2.136\",\"100.100.2.138\"],\"gateway\": \"10.0.3.1\",\"ip\": \"10.0.3.12\",\"netmask\": \"255.255.255.0\",\"type\": \"manual\"}},[],{\"bosh\": {\"mbus\": {\"cert\\"$xxx/\"}}],\"context\": {\"director_uuid\": \"305e8ff0-2651-488f-4c6c-a5fc711189b4\",\"vm\": {\"stemcell\": {\"api_version\": 2}}},\"api_version\": 1}` change i-uf61h0f16m81g4i1p7q9 to Running failed: [SDK.TimeoutError] The request timed out 6 times(6 for retry), perhaps we should have the threshold raised a little?\ncaused by:\nPost https://ecs-cn-hangzhou.aliyuncs.com/?AccessKeyId=\uAction=DescribeInstanceAttribute\u0026Format=JSON\u0026InstanceId=i-uf61h0f16m81g4i1p7q9\u0026RegionId=cn-shanghai\u0026SecurityToken=\u0026Signature=xxS6vbswq%!B(MISSING)0lQOvunCsyi2he2Go%!D(MISSING)\u0026SignatureMethod=HMAC-SHA1\u0026SignatureNonce=6ce25ba1710b45109e763deacd60695b\u0026SignatureType=\u0026SignatureVersion=1.0\u0026Timestamp=2019-05-03T09%!A(MISSING)40%!A(MISSING)38Z\u0026Version=2014-05-26: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)","ok_to_retry":false}

Exit code 1 {code}

Second attempt to deploy the VM: {code}

Uploading stemcell 'bosh-alicloud-kvm-ubuntu-xenial-go_agent/250.23'... Skipped [Stemcell already uploaded] (00:00:00)

Started deploying

Creating VM for instance 'bosh/0' from stemcell 'm-uf6b6vtpujfo9zq2104i'... Failed (00:03:47)

Failed deploying (00:03:47)

Stopping registry... Finished (00:00:00)

Cleaning up rendered CPI jobs... Finished (00:00:00)

Deploying:

Creating instance 'bosh/0':

Creating VM:

  Creating vm with stemcell cid 'm-uf6b6vtpujfo9zq2104i':

    CPI 'create_vm' method responded with error: CmdError{"type":"Bosh::Clouds::CloudError","message":"input=`{\"method\": \"create_vm\",\"arguments\": [\"23f2cace-a690-4a9c-7063-f710ed3fa596\",\"m-uf6b6vtpujfo9zq2104i\",{\"availability_zone\": \"cn-shanghai-e\",\"ephemeral_disk\": {\"category\": \"cloud_efficiency\",\"encrypted\": true,\"size\": \"51_200\"},\"instance_type\": \"ecs.c5.xlarge\",\"key_pair_name\": \"bosh-ssh-cf-canary2019050202420793370000000d\",\"system_disk\": {\"category\": \"cloud_efficiency\",\"encrypted\": true,\"size\": \"51_200\"}},{\"private\": {\"cloud_properties\": {\"security_group_ids\": [\"sg-uf6866edq67pmgnxkaz7\"],\"vswitch_id\": \"vsw-uf6jkcjaf400oeb5rgsx7\"},\"default\": [\"dns\",\"gateway\"],\"dns\": [\"100.100.2.136\",\"100.100.2.138\"],\"gateway\": \"10.0.3.1\",\"ip\": \"10.0.3.12\",\"netmask\": \"255.255.255.0\",\"type\": \"manual\"}},[],{\"bosh\": {\"mbus\": {\"cert\": {\"ca\": \"}}],\"context\": {\"director_uuid\": \"305e8ff0-2651-488f-4c6c-a5fc711189b4\",\"vm\": {\"stemcell\": {\"api_version\": 2}}},\"api_version\": 1}` create instance failed with input={\"Scheme\":\"HTTPS\",\"Method\":\"POST\",\"Domain\":\"ecs-cn-hangzhou.aliyuncs.com\",\"Port\":\"\",\"RegionId\":\"\",\"AcceptFormat\":\"JSON\",\"QueryParams\":{\"AccessKeyId\":\"\",\"Action\":\"CreateInstance\",\"ClientToken\":\"bosh-cpi-1556877860501762803-4e75e098-b524-41bc-b491-b32d1588e63\",\"DataDisk.1.Category\":\"cloud_efficiency\",\"DataDisk.1.DeleteWithInstance\":\"true\",\"DataDisk.1.Encrypted\":\"true\",\"DataDisk.1.Size\":\"50\",\"Format\":\"JSON\",\"ImageId\":\"m-uf6b6vtpujfo9zq2104i\",\"InstanceChargeType\":\"PostPaid\",\"InstanceType\":\"ecs.c5.xlarge\",\"IoOptimized\":\"optimized\",\"KeyPairName\":\"bosh-ssh-cf-canary2019050202420793370000000d\",\"PrivateIpAddress\":\"10.0.3.12\",\"RegionId\":\"cn-shanghai\",\"SecurityGroupId\":\"sg-uf6866edq67pmgnxkaz7\",\"SecurityToken\":\"\",\"Signature\":\"psCnr1vR1cxyZ/pLIY73196AZ+w=\",\"SignatureMethod\":\"HMAC-SHA1\",\"SignatureNonce\":\"4ee409dbcc8643b0948b1733382a4945\",\"SignatureType\":\"\",\"SignatureVersion\":\"1.0\",\"SpotPriceLimit\":\"0.000000\",\"SystemDisk.Category\":\"cloud_efficiency\",\"SystemDisk.Size\":\"50\",\"Timestamp\":\"2019-05-03T10:08:06Z\",\"UserData\":\"eyJyZWdpc3RyeSI6eyJlbmRwb2ludCI6Imh0dHA6Ly9yZWdpc3RyeTplZWRlNmZmZC1hMWM0LTQzMjMtNzZhOS04MDY5NDg2Yjk2YTFAMTI3LjAuMC4xOjY5MDEifX0=\",\"VSwitchId\":\"vsw-uf6jkcjaf400oeb5rgsx7\",\"Version\":\"2014-05-26\",\"ZoneId\":\"cn-shanghai-e\"},\"Headers\":{\"Accept-Encoding\":\"identity\",\"Content-Type\":\"application/x-www-form-urlencoded\",\"x-sdk-client\":\"golang/1.0.0\",\"x-sdk-core-version\":\"0.0.1\",\"x-sdk-invoke-type\":\"normal\"},\"FormParams\":{},\"Content\":\"\",\"ResourceOwnerId\":\"\",\"HpcClusterId\":\"\",\"SecurityEnhancementStrategy\":\"\",\"KeyPairName\":\"bosh-ssh-cf-canary2019050202420793370000000d\",\"SpotPriceLimit\":\"0.000000\",\"DeletionProtection\":\"\",\"ResourceGroupId\":\"\",\"HostName\":\"\",\"Password\":\"\",\"Tag\":null,\"AutoRenewPeriod\":\"\",\"NodeControllerId\":\"\",\"Period\":\"\",\"DryRun\":\"\",\"OwnerId\":\"\",\"VSwitchId\":\"vsw-uf6jkcjaf400oeb5rgsx7\",\"PrivateIpAddress\":\"10.0.3.12\",\"SpotStrategy\":\"\",\"PeriodUnit\":\"\",\"InstanceName\":\"\",\"AutoRenew\":\"\",\"InternetChargeType\":\"\",\"ZoneId\":\"cn-shanghai-e\",\"InternetMaxBandwidthIn\":\"\",\"UseAdditionalService\":\"\",\"ImageId\":\"m-uf6b6vtpujfo9zq2104i\",\"ClientToken\":\"bosh-cpi-1556877860501762803-4e75e098-b524-41bc-b491-b32d1588e63\",\"VlanId\":\"\",\"SpotInterruptionBehavior\":\"\",\"IoOptimized\":\"optimized\",\"SecurityGroupId\":\"sg-uf6866edq67pmgnxkaz7\",\"InternetMaxBandwidthOut\":\"\",\"Description\":\"\",\"SystemDiskCategory\":\"cloud_efficiency\",\"UserData\":\"eyJyZWdpc3RyeSI6eyJlbmRwb2ludCI6Imh0dHA6Ly9yZWdpc3RyeTplZWRlNmZmZC1hMWM0LTQzMjMtNzZhOS04MDY5NDg2Yjk2YTFAMTI3LjAuMC4xOjY5MDEifX0=\",\"PasswordInherit\":\"\",\"InstanceType\":\"ecs.c5.xlarge\",\"InstanceChargeType\":\"PostPaid\",\"DeploymentSetId\":\"\",\"InnerIpAddress\":\"\",\"ResourceOwnerAccount\":\"\",\"OwnerAccount\":\"\",\"SystemDiskDiskName\":\"\",\"RamRoleName\":\"\",\"DedicatedHostId\":\"\",\"ClusterId\":\"\",\"CreditSpecification\":\"\",\"DataDisk\":[{\"DiskName\":\"\",\"SnapshotId\":\"\",\"Size\":\"50\",\"Encrypted\":\"true\",\"Description\":\"\",\"Category\":\"cloud_efficiency\",\"Device\":\"\",\"DeleteWithInstance\":\"true\"}],\"SystemDiskSize\":\"50\",\"SystemDiskDescription\":\"\"} : over max retry: SDK.ServerError\nErrorCode: InvalidPrivateIpAddress.Duplicated\nRecommend: \nRequestId: 679E0D32-F871-45D8-9D9C-8470CA50751E\nMessage: Specified private IP address is duplicated.","ok_to_retry":false}

Exit code 1

{code}

xiaozhu36 commented 5 years ago

HI @OlegGerber This issue came from network connection failed leads deleting vm failed. I will push the product team to accelerate the specified request domain to avoid the network connection. In addition, I will provide a parameter to remove the left vms when deploying a new bosh director. Does it can meet you?

alexwo commented 5 years ago

Hi @xiaozhu36,

Can you share more insights on the connectivity issue that you see? what is the "specified request domain" that need to be accelerated?

Thanks, Alex

xiaozhu36 commented 5 years ago

When you access ecs or other services, the CPI will get an endpoint based on the request region. When region is cn-shanghai, its endpoint is ecs-cn-hangzhou.aliyuncs.com, and this endpoint will be accelerated when your local host is not in China mainland to avoid some needless network issues.

alexwo commented 5 years ago

Hi @xiaozhu36, We are using hosts located in China main land only to access the "ecs-cn-hangzhou.aliyuncs.com" endpoint. maybe the reason for the error is not network in this case?

xiaozhu36 commented 5 years ago

The local host is not your computer, but the machine that sent the CPI request

alexwo commented 5 years ago

yes but we run the bosh director on a VM deployed in China main land.

xiaozhu36 commented 5 years ago

Can you ping the endpoint ecs-cn-hangzhou.aliyuncs.com in your director? like #76

OlegGerber commented 5 years ago

Hi @xiaozhu36, See the details below: bosh/eff5:~$ nslookup ecs-cn-hangzhou.aliyuncs.com Server: 100.100.2.136 Address: 100.100.2.136#53

ecs-cn-hangzhou.aliyuncs.com canonical name = popunify-public.aliyuncs.com. popunify-public.aliyuncs.com canonical name = popunify-public.aliyuncs.com.gds.alibabadns.com. Name: popunify-public.aliyuncs.com.gds.alibabadns.com Address: 106.11.61.112

ping ecs-cn-hangzhou.aliyuncs.com

PING popunify-public.aliyuncs.com.gds.alibabadns.com (106.11.61.112) 56(84) bytes of data. 64 bytes from 106.11.61.112: icmp_seq=1 ttl=49 time=3.09 ms 64 bytes from 106.11.61.112: icmp_seq=2 ttl=49 time=3.04 ms

lemondlut commented 5 years ago

Hi OlegGerber, Could you provide a test case so that we can reproduce the issue, please? It is difficult for us to reproduce the issue, appreciate your kindly help.

OlegGerber commented 5 years ago

Hi @lemondlut, This issue is not reproduced on every bosh deployment but we see it frequently, almost every day. Please see the error to understand the flow:

  1. The bosh instance update or deployment from scratch fails with timeout issue:
    
    Task 1780

Task 1780 | 04:31:02 | Preparing deployment: Preparing deployment (00:00:01) Task 1780 | 04:31:03 | Preparing deployment: Rendering templates (00:00:01) Task 1780 | 04:31:04 | Preparing package compilation: Finding packages to compile (00:00:00) Task 1780 | 04:31:04 | Updating instance postgresql: postgresql/a372be1a-bf03-4b6f-958e-316d41d10ef5 (0) (canary) (00:11:44) L Error: CPI error 'Bosh::Clouds::CloudError' with message 'input={"method": "create_vm","arguments": ["025ccff7-c09f-4258-99db-ef746bf23fc0","m-uf6216ctez1mji005iyz",{"availability_zone": "cn-shanghai-f","ephemeral_disk": {"encrypted": true,"size": 20480},"instance_type": "ecs.g5.4xlarge"},{"sf_bosh_services": {"type": "manual","ip": "10.11.104.248","netmask": "255.255.192.0","cloud_properties": {"security_group_ids": ["sg-uf62qvrx5zsy47bfccy3","sg-uf6e1i35512nblhamgvi"],"vswitch_id": "vsw-uf6rudkz8c8d03hfanm5a"},"default": ["dns","gateway"],"dns": ["100.100.2.136","100.100.2.138"],"gateway": "10.11.64.1"}},["d-uf6erxni5w5ce6wl9tsm"],{"bosh": {"blobstores": [{"options": {"endpoint": "https://10.0.3.21:25250","password": "","tls": {"cert": {"ca": "-----BEGIN CERTIFICATE----- ----END RSA PRIVATE KEY-----\n"}},"group": "bosh-sf-service-fabrik-1279-ccbfe480-9d34-4909-9c08-f8a3041d8b9d-postgresql","groups": ["bosh-sf","service-fabrik-1279-ccbfe480-9d34-4909-9c08-f8a3041d8b9d","postgresql","bosh-sf-service-fabrik-1279-ccbfe480-9d34-4909-9c08-f8a3041d8b9d","service-fabrik-1279-ccbfe480-9d34-4909-9c08-f8a3041d8b9d-postgresql","bosh-sf--service-fabrik-1279-ccbfe480-9d34-4909-9c08-f8a3041d8b9d-postgresql"]}}],"context": {"director_uuid": "6fc5522b-8dba-4684-986c-e2c2082f2b8c","request_id": "cpi-102736","vm": {"stemcell": {"api_version": 2}}},"api_version": 1} change i-uf6e6nqcgk7ddziglw48 to Running failed: change instance i-uf6e6nqcgk7ddziglw48 to Running timeout' in 'create_vm' CPI method (CPI request ID: 'cpi-102736') .....

Task 1780 Started Mon Jul 1 04:31:02 UTC 2019 Task 1780 Finished Mon Jul 1 04:42:48 UTC 2019 Task 1780 Duration 00:11:46 Task 1780 error

Capturing task '1780' output: Expected task '1780' to succeed but state is 'error'

Exit code 1


2. The second attempt for the same bosh deployment fails with the "ErrorCode: InvalidPrivateIpAddress.Duplicated", which means the VM from the last deployment not deleted from the landscape, despite the fact bosh task failed.

Task 1813

Task 1813 | 05:03:21 | Preparing deployment: Preparing deployment (00:00:01) Task 1813 | 05:03:22 | Preparing deployment: Rendering templates (00:00:01) Task 1813 | 05:03:23 | Preparing package compilation: Finding packages to compile (00:00:00) Task 1813 | 05:03:23 | Creating missing vms: postgresql/a372be1a-bf03-4b6f-958e-316d41d10ef5 (0) (00:03:36) L Error: CPI error 'Bosh::Clouds::CloudError' with message 'input=`{"method": "create_vm","arguments": ["16922489-4467-4fad-ab1a-dad39a49d9b0","m-uf6216ctez1mji005iyz",{"availability_zone": "cn-shanghai-f","ephemeral_disk": {"encrypted": true,"size": 20480},"instance_type": "ecs.g5.4xlarge"},{"sf_bosh_services": {"type": "manual","ip": "10.11.104.248","netmask": "255.255.192.0","cloud_properties": {"security_group_ids": ["sg-uf62qvrx5zsy47bfccy3","sg-uf6e1i35512nblhamgvi"],"vswitch_id": "vsw-uf6rudkz8c8d03hfanm5a"},"default": ["dns","gateway"],"dns": ["100.100.2.136","100.100.2.138"],"gateway": "10.11.64.1"}},["d-uf6erxni5w5ce6wl9tsm"],{"bosh": {"blobstores": [{"options": {"endpoint": "https://10.0.3.21:25250","password": ""," ..... "ResourceOwnerAccount":"","OwnerAccount":"","SystemDiskDiskName":"","RamRoleName":"","DedicatedHostId":"","ClusterId":"","CreditSpecification":"","DataDisk":[{"DiskName":"","SnapshotId":"","Size":"20","Encrypted":"true","Description":"","Category":"cloud_efficiency","KMSKeyId":"","Device":"","DeleteWithInstance":"true"}],"SystemDiskSize":"40","SystemDiskDescription":""} : over max retry: SDK.ServerError ErrorCode: InvalidPrivateIpAddress.Duplicated Recommend: RequestId: DEF2F3D9-010C-4F8F-880C-5C1BA12A20CA Message: Specified private IP address is duplicated.' in 'create_vm' CPI method (CPI request ID: 'cpi-289838') ...

Task 1813 Started Mon Jul 1 05:03:21 UTC 2019 Task 1813 Finished Mon Jul 1 05:06:59 UTC 2019 Task 1813 Duration 00:03:38 Task 1813 error

Capturing task '1813' output: Expected task '1813' to succeed but state is 'error'

Exit code 1

xiaozhu36 commented 5 years ago

HI @OlegGerber Although, the bosh-alicloud- cpi has added rollback when running create vm failed, it can not avoid connection timeout. Now, we have configured global acceleration service for the endpoint ecs-cn-hangzhou.aliyuncs.com and it will avoid the connection timeout error when you access China mainland region service. Besides, I enhanced the rollback and it will retry 10 times when rollback failed. Now, it has been supported by the latest version 24.0.0. Please check it. Thanks.

OlegGerber commented 5 years ago

HI @xiaozhu36 , It seems to be the issue with "InvalidIPAddress.AlreadyUsed" still reproducing on our landscapes during the component update. However, the error message now is: "Message: The specified IPAddress is already used by other resource.' in 'create_vm' CPI method (CPI request ID: 'cpi-465242')"

Task 17061 | 19:40:56 | Compiling packages: elasticsearch/6afebaf8a3637d6e2aae6db61aafaa4abd7655bb28a1c9a25ec9118cb245562e (00:01:43)
Task 17061 | 19:41:53 | Compiling packages: python-3.6/976f472c218622aca411c953bc97d9d47665fd8f (00:02:49)
Task 17061 | 19:41:53 | Compiling packages: curator/29f3fa707032321547ccd95b58846d630eb3ff4bf80de879605796827d602f21 (00:00:10)
Task 17061 | 19:42:22 | Updating instance elasticsearch_master: elasticsearch_master/5133fda1-5ee1-4f65-a852-4d5eba9bbd0a (0) (canary) (00:03:00)
Task 17061 | 19:45:22 | Updating instance elasticsearch_master: elasticsearch_master/d1f98524-bfd1-4c87-894d-b2bd9e72f836 (1) (00:03:13)
Task 17061 | 19:48:35 | Updating instance elasticsearch_master: elasticsearch_master/bfb87525-35ab-451d-a248-b9e8b054eb00 (2) (00:03:26)
Task 17061 | 19:52:01 | Updating instance elasticsearch_data: elasticsearch_data/f4d83e89-6eb2-49f0-a689-54d7166cb5a0 (0) (canary) (00:04:31)
Task 17061 | 19:56:32 | Updating instance elasticsearch_data: elasticsearch_data/05c08b43-65fb-49a7-81b8-3b73e878e992 (1) (00:05:01)
Task 17061 | 20:01:34 | Updating instance zookeeper: zookeeper/3278e6cf-464f-44a3-a553-ef982116db4b (0) (canary) (00:02:45)
Task 17061 | 20:04:19 | Updating instance zookeeper: zookeeper/29eee6d1-f263-4296-adea-e4ebda51d32d (1) (00:04:03)
                     L Error: CPI error 'Bosh::Clouds::CloudError' with message 'input=`{"method": "create_vm","arguments": ["648af04d-48f0-4a0c-b4a3-51464463a962","m-uf6ff6n8gn1fqfg1i2fu",{"availability_zone": "cn-shanghai-f","ephemeral_disk": {"encrypted": true,"size": 20480},"instance_type": "ecs.c5.large"},{"logsearch": {"type": "manual","ip": "10.2.4.14","netmask": "255.255.0.0","cloud_properties": {"security_group_ids": ["sg-uf6076x9fu1dy4dzka87","sg-uf637q6lqzqr22b9tfw0","sg-uf6fuqggczshqf8tx4nq","sg-uf6aixmrq1fmw1ocjdsy","sg-uf65ns0580xqvcvk21jy","sg-uf6825diaozs5g81gqzm","sg-uf67m2mj8kzatnko0brf"],"vswitch_id": "vsw-uf6wf1cqwuseos3x1dpzd"},"default": ["dns","gateway"],"dns": ["100.100.2.136","100.100.2.138"],"gateway": "10.2.0.1"}},["d-uf60jpji0rlh06c6bvw8"],{"bosh": {"blobstores": [{"options": {"endpoint": "https://10.0.3.11:25250","password": \n"}},"group": "bosh-cf-cn40-staging-logsearch-zookeeper","groups": ["bosh-cf-cn40-staging","logsearch","zookeeper","bosh-cf-cn40-staging-logsearch","logsearch-zookeeper","bosh-cf-cn40-staging-logsearch-zookeeper"]}}],"context": {"director_uuid": "113d6236-e1e5-440d-88cc-f3b9824a934d","request_id": "cpi-465242","vm": {"stemcell": {"api_version": 2}}},"api_version": 1}` create instance failed with input={"Scheme":"HTTPS","Method":"POST","Domain":"ecs-cn-hangzhou.aliyuncs.com","Port":"","RegionId":"","ReadTimeout":0,"ConnectTimeout":0,"AcceptFormat":"JSON","QueryParams":{"AccessKeyId":"","Action":"CreateInstance","ClientToken":"Bosh-Cpi-CreateInstance-1564689889-284eb0e6-12c8-4b57-9c70-7e8b2","DataDisk.1.Category":"cloud_efficiency","DataDisk.1.DeleteWithInstance":"true","DataDisk.1.Encrypted":"true","DataDisk.1.Size":"20","Format":"JSON","ImageId":"m-uf6ff6n8gn1fqfg1i2fu","InstanceChargeType":"PostPaid","InstanceType":"ecs.c5.large","IoOptimized":"optimized","PrivateIpAddress":"10.2.4.14","RegionId":"cn-shanghai","SecurityGroupId":"sg-uf6076x9fu1dy4dzka87","SecurityToken":"","Signature":"=","SignatureMethod":"HMAC-SHA1","SignatureNonce":"945d320b331d47c19b68ff14a365c219","SignatureType":"","SignatureVersion":"1.0","SpotPriceLimit":"0.000000","SystemDisk.Category":"cloud_efficiency","SystemDisk.Size":"40","Timestamp":"2019-08-01T20:08:21Z","UserData":"=","VSwitchId":"vsw-uf6wf1cqwuseos3x1dpzd","Version":"2014-05-26","ZoneId":"cn-shanghai-f"},"Headers":{"Accept-Encoding":"identity","Content-Type":"application/x-www-form-urlencoded","x-sdk-client":"golang/1.0.0","x-sdk-core-version":"0.0.1","x-sdk-invoke-type":"normal"},"FormParams":{},"Content":"","ResourceOwnerId":"","HpcClusterId":"","SecurityEnhancementStrategy":"","KeyPairName":"","SpotPriceLimit":"0.000000","DeletionProtection":"","ResourceGroupId":"","HostName":"","Password":"","Tag":null,"AutoRenewPeriod":"","NodeControllerId":"","Period":"","DryRun":"","OwnerId":"","CapacityReservationPreference":"","VSwitchId":"vsw-uf6wf1cqwuseos3x1dpzd","PrivateIpAddress":"10.2.4.14","SpotStrategy":"","PeriodUnit":"","InstanceName":"","AutoRenew":"","InternetChargeType":"","ZoneId":"cn-shanghai-f","InternetMaxBandwidthIn":"","UseAdditionalService":"","ImageId":"m-uf6ff6n8gn1fqfg1i2fu","ClientToken":"Bosh-Cpi-CreateInstance-1564689889-284eb0e6-12c8-4b57-9c70-7e8b2","VlanId":"","SpotInterruptionBehavior":"","IoOptimized":"optimized","SecurityGroupId":"sg-uf6076x9fu1dy4dzka87","InternetMaxBandwidthOut":"","Description":"","SystemDiskCategory":"cloud_efficiency","CapacityReservationId":"","UserData":"=","":"","InstanceType":"ecs.c5.large","InstanceChargeType":"PostPaid","DeploymentSetId":"","InnerIpAddress":"","ResourceOwnerAccount":"","OwnerAccount":"","SystemDiskDiskName":"","RamRoleName":"","DedicatedHostId":"","ClusterId":"","CreditSpecification":"","DataDisk":[{"DiskName":"","SnapshotId":"","Size":"20","Encrypted":"true","Description":"","Category":"cloud_efficiency","KMSKeyId":"","Device":"","DeleteWithInstance":"true"}],"SystemDiskSize":"40","SystemDiskDescription":""} : over max retry: SDK.ServerError
ErrorCode: InvalidIPAddress.AlreadyUsed
Recommend:
RequestId: 9B9AA73D-EFDC-49FF-B42C-2732F6C1CA25
Message: The specified IPAddress is already used by other resource.' in 'create_vm' CPI method (CPI request ID: 'cpi-465242')
xiaozhu36 commented 5 years ago

HI @OlegGerber Your private Ip 10.2.4.14 has been used by the instance i-uf64zjiazflkqpzqtvtl which created in 2019-08-02 14:46:42. Can you check it or your manifest?

OlegGerber commented 5 years ago

Hi @xiaozhu36 , instance ID:i-uf64zjiazflkqpzqtvtl belong to the bosh instance zookeeper.29eee6d1-f263-4296-adea-e4ebda51d32d, which is failed on the bosh update process.

xiaozhu36 commented 5 years ago

HI @OlegGerber Why did the bosh update process failed? What is your update command? Did the failure come from bosh cpi?

OlegGerber commented 5 years ago

Hi @xiaozhu36 , Bosh update process failed on duplicate private IP (which is in use by VM itself) The update is a regular bosh update procedure, which runs as a result of any manifest changes or stemcell update. During the update, old VM may be deleted and a new one created, it seems to be that once VM deleted, it takes time to release the private IP. Few retries fix the issue. Can you please check if it may be the reason for this kind of failures?

xiaozhu36 commented 5 years ago

HI @OlegGerber You are right and I agree with you. But, actually, the CPI has added the retry strategy (times: 20, interval 10s) for this error: https://github.com/cloudfoundry-incubator/bosh-alicloud-cpi-release/blob/master/src/bosh-alicloud-cpi/alicloud/instance_manager.go#L100. Did you use the latest bosh cpi?

xiaozhu36 commented 5 years ago

Hi @OlegGerber Is there any other error before the error InvalidIPAddress.AlreadyUsed? There is one case: connection timeout or request timeout results in the specified instance has been created, but bosh return a timeout error. When you retry bosh command, the IP has been used by the last VM.

xiaozhu36 commented 5 years ago

HI @OlegGerber The latest cpi release 25.0.0 has been published and it has enlarged the retry times for the error InvalidIPAddress.AlreadyUsed. Please check it.

xiaozhu36 commented 5 years ago

HI @OlegGerber The CPI release 26.0.0 has been published and it has the following improvements:

  1. enlarge the timeout to 10 minutes when waiting for instance status, like running, deleted and so on In some extreme case, creating vm may be cost too more time, so enlarging waiting time can avoid some internal error. If there is also timeout, the cpi will rollback and delete the vm. In the rollback, enlarging waiting time can ensure the vm can be deleted successfully. It depends on backend system will kill the vm forcibly after 5 minutes and remove it.

  2. add one method to cleanup the remaining network interface which occupies the vm's private ip In some extreme case, if one vm is deleted in a non-normal ways, the vm's primarily network interface may be left and it will occupy the vm 's private ip. It also will throw an error InvalidIPAddress.AlreadyUsed when recreating a new vm. So, after deleting the instance, check and cleanup the remaining network interface is necessary.

  3. add extra error message when an error happened and it can help us to find the error reason more easily and quickly.

Please use the release 26.0.0 to have a try.

OlegGerber commented 5 years ago

Hi @xiaozhu36 , Unfortunately, this issue reproduced again with version: "26.0.0"


  stemcells:
+ - alias: default
+   os: ubuntu-xenial
+   version: '456.16'
- - alias: default
-   os: ubuntu-xenial
-   version: '456.12'
Task 3596

Task 3596 | 02:05:35 | Preparing deployment: Preparing deployment (00:00:16)
Task 3596 | 02:05:52 | Preparing deployment: Rendering templates (00:00:24)
Task 3596 | 02:06:21 | Preparing package compilation: Finding packages to compile (00:00:01)
Task 3596 | 02:06:24 | Updating instance adapter: adapter/5d37d36c-0e80-410d-9d03-01c81118b4cd (1) (canary) (00:03:56)
Task 3596 | 02:10:20 | Updating instance diego-api: diego-api/ea07b6fe-43d2-45a1-938b-51f00e407b9f (0) (canary) (00:03:19)
Task 3596 | 02:13:39 | Updating instance diego-api: diego-api/7215fb64-8992-4c54-a9fd-d30b95980321 (1) (00:03:17)
Task 3596 | 02:16:56 | Updating instance uaa: uaa/4906ce75-ddbf-4d7c-82cd-79dad44ddec0 (0) (canary) (00:04:10)
Task 3596 | 02:21:06 | Updating instance uaa: uaa/1605c073-986e-4434-a6ea-9e6e0128bb14 (1) (00:04:46)
Task 3596 | 02:25:52 | Updating instance bits: bits/9efbba78-bf85-47e8-bcbe-7ddaaa69e7a9 (0) (canary) (00:03:10)
Task 3596 | 02:29:02 | Updating instance bits: bits/0e294baf-b909-45d1-bdfa-4d772f446a88 (1) (00:03:04)
Task 3596 | 02:32:06 | Updating instance cc-worker: cc-worker/cc4d3f03-7918-4d05-b486-04176d1444e7 (0) (canary)
Task 3596 | 02:32:06 | Updating instance api: api/b1d2f599-f29b-4c4f-9c3f-5c85fc2c89f1 (0) (canary)
Task 3596 | 02:35:41 | Updating instance cc-worker: cc-worker/cc4d3f03-7918-4d05-b486-04176d1444e7 (0) (canary) (00:03:35)
Task 3596 | 02:35:41 | Updating instance cc-worker: cc-worker/01a626a2-0efd-4607-ac78-b148230d95ec (1)
Task 3596 | 02:36:44 | Updating instance api: api/b1d2f599-f29b-4c4f-9c3f-5c85fc2c89f1 (0) (canary) (00:04:38)
Task 3596 | 02:36:44 | Updating instance api: api/e5e8416c-6afd-45a1-80cf-2538966f2130 (2)
Task 3596 | 02:38:56 | Updating instance cc-worker: cc-worker/01a626a2-0efd-4607-ac78-b148230d95ec (1) (00:03:15)
Task 3596 | 02:40:50 | Updating instance api: api/e5e8416c-6afd-45a1-80cf-2538966f2130 (2) (00:04:06)
Task 3596 | 02:40:50 | Updating instance api: api/14f3a6b5-ebcb-47a4-b74c-07261d602ade (3) (00:04:05)
Task 3596 | 02:44:55 | Updating instance api: api/7353dd97-8e3d-4123-8791-ac3653594767 (1) (00:04:28)
Task 3596 | 02:49:23 | Updating instance scheduler: scheduler/ab98aec9-af7f-4c8d-ac2a-0e36a0d4846c (0) (canary) (00:06:02)
                     L Error: CPI error 'Bosh::Clouds::CloudError' with message 'input=`{"method": "create_vm","arguments": ["3f225874-fc19-456a-8e79-5c003aaaedac","m-uf639zkz5hy5p22rz231",{"availability_zone": "cn-shanghai-e","ephemeral_disk": {"encrypted": true,"size": 20480},"instance_type": "ecs.c5.xlarge"},{"cf": {"type": "manual","ip": "10.1.1.71","netmask": "255.255.0.0","cloud_properties": {"security_group_ids": ["sg-uf68jc3nwxdjnixbvrw7","sg-uf66oinxyjtague1d1ox","sg-uf65ns0580xqvcvk21jy","sg-uf66oinxyjtague1d1ot","sg-uf6fuqggczshqh7v16ng","sg-uf6825diaozs5g81gqzm","sg-uf67m2mj8kzatnko0brf"],"vswitch_id": "vsw-uf6didcjh497z4wa8bpkc"},"default": ["dns","gateway"],"dns": ["100.100.2.136","100.100.2.138"],"gateway": "10.1.0.1"}},[],{"bosh": {"blobstores": [{"options": {"endpoint": "https://10.0.3.11:25250",","group": "bosh-cf-cn40-staging-cf-scheduler","groups": ["bosh-cf-cn40-staging","cf","scheduler","bosh-cf-cn40-staging-cf","cf-scheduler","bosh-cf-cn40-staging-cf-scheduler"]}}],"context": {"director_uuid": "e9a039b7-0a89-4d52-a3fd-71a8dc125170","request_id": "cpi-198350","vm": {"stemcell": {"api_version": 2}}},"api_version": 1}` create instance failed with input={"Scheme":"HTTPS","Method":"POST","Domain":"ecs-cn-hangzhou.aliyuncs.com","Port":"","RegionId":"","ReadTimeout":0,"ConnectTimeout":0,"AcceptFormat":"JSON","QueryParams":{"AccessKeyId":"LTAIPdGJTly9Iyo3","Action":"CreateInstance","ClientToken":"Bosh-Cpi-CreateInstance-1567478989-6177f87c-0084-4128-969f-85523","DataDisk.1.Category":"cloud_efficiency","DataDisk.1.DeleteWithInstance":"true","DataDisk.1.Encrypted":"true","DataDisk.1.Size":"20","Format":"JSON","ImageId":"m-uf639zkz5hy5p22rz231","InstanceChargeType":"PostPaid","InstanceType":"ecs.c5.xlarge","IoOptimized":"optimized","PrivateIpAddress":"10.1.1.71","RegionId":"cn-shanghai","SecurityGroupId":"sg-uf68jc3nwxdjnixbvrw7","SecurityToken":"","=","SignatureMethod":"HMAC-SHA1","SignatureNonce":"","SignatureType":"","SignatureVersion":"1.0","SpotPriceLimit":"0.000000","SystemDisk.Category":"cloud_efficiency","SystemDisk.Size":"40","Timestamp":"2019-09-03T02:55:25Z","UserData":""VSwitchId":"vsw-uf6didcjh497z4wa8bpkc","Version":"2014-05-26","ZoneId":"cn-shanghai-e"},"Headers":{"Accept-Encoding":"identity","Content-Type":"application/x-www-form-urlencoded","x-sdk-client":"golang/1.0.0","x-sdk-core-version":"0.0.1","x-sdk-invoke-type":"normal"},"FormParams":{},"Content":"","ResourceOwnerId":"","HpcClusterId":"","SecurityEnhancementStrategy":"","KeyPairName":"","SpotPriceLimit":"0.000000","DeletionProtection":"","ResourceGroupId":"","HostName":"","":"","StorageSetPartitionNumber":"","Tag":null,"AutoRenewPeriod":"","NodeControllerId":"","Period":"","DryRun":"","OwnerId":"","CapacityReservationPreference":"","VSwitchId":"vsw-uf6didcjh497z4wa8bpkc","PrivateIpAddress":"10.1.1.71","SpotStrategy":"","PeriodUnit":"","InstanceName":"","AutoRenew":"","InternetChargeType":"","ZoneId":"cn-shanghai-e","InternetMaxBandwidthIn":"","UseAdditionalService":"","Affinity":"","ImageId":"m-","ClientToken":"Bosh-Cpi-CreateInstance-1567478989-6177f87c-0084-4128-969f-85523","VlanId":"","SpotInterruptionBehavior":"","IoOptimized":"optimized","SecurityGroupId":"sg-uf68jc3nwxdjnixbvrw7","InternetMaxBandwidthOut":"","Description":"","SystemDiskCategory":"cloud_efficiency","CapacityReservationId":"","SystemDiskPerformanceLevel":"","UserData":"":"","InstanceType":"ecs.c5.xlarge","Arn":null,"InstanceChargeType":"PostPaid","DeploymentSetId":"","InnerIpAddress":"","ResourceOwnerAccount":"","OwnerAccount":"","Tenancy":"","SystemDiskDiskName":"","RamRoleName":"","DedicatedHostId":"","ClusterId":"","CreditSpecification":"","DataDisk":[{"DiskName":"","SnapshotId":"","Size":"20","Encrypted":"true","PerformanceLevel":"","Description":"","Category":"cloud_efficiency","KMSKeyId":"","Device":"","DeleteWithInstance":"true"}],"StorageSetId":"","SystemDiskSize":"40","SystemDiskDescription":""} : over max retry: SDK.ServerError

ErrorCode: InvalidIPAddress.AlreadyUsed
Recommend: 
Eileen-HHY-Aliyun commented 5 years ago

As discussed in the call right now, @xiaozhu36 will have a deep-diving here to solve it. He may bother you later @OlegGerber .

ksanajoe commented 5 years ago

Do we have any progress, we are using version 27, still has this error.

xiaozhu36 commented 3 years ago

Fixed by version v32.0.0. Please check it.