Closed dejanangelov86 closed 2 years ago
Hi @dejanangelov86,
I don't think that an extended timeout will help here because the connection is reseted. Regarding the retries, the error can happen with all bosh operations and actions like create_vm
or create_disk
can't be retried because you don't know whether a VM was created successfully or not. Like the disk snapshots in your case which get created successfully. Retrying will produce orphaned resources. I would suggest you to analyse the connectivity between the source machine to the google API as already suggested in the GCP support ticket. Analysing network issues is not easy but it should be possible to find out where the connection gets reseted. Is the source machine where the BOSH CPI is running also running in GCP? If yes, you could use GCP tools to analyse the network traffic when the error happens.
there has been no activity on this issue for more than 2 weeks. if you still experience this please reopen this issue
Hello team,
We have network issue during some bosh operations on GCP landscapes.
Also the call to google API is successfully, because during snapshot creation, the snapshot is successfully created, but the response is invalid and you can see the errors below:
Details of any failed request(s) (e.g. URL or API method, date and time, input parameters, response code, error message, screenshots) : [RetryTransport] 2022/08/11 09:27:38 INFO - net.Error was not retryable: read tcp 10.201.2.6:49892->216.58.196.74:443: read: connection reset by peer [GoogleOperationService] 2022/08/11 09:27:38 DEBUG - Google Operation 'operation-1660210032949-5e5f3c5443814-c0fa8562-e1427761' finished with an error: &url.Error {Op:"Get", [URL:"https://compute.googleapis.com/compute/v1/projects/gcp/zones/asia-south1-a/operations/operation-1660210032949-5e5f3c5443814-c0fa8562-e1427761?alt=json&prettyPrint=false|url:%22https://compute.googleapis.com/compute/v1/projects/gcp/zones/asia-south1-a/operations/operation-1660210032949-5e5f3c5443814-c0fa8562-e1427761?alt=json&prettyPrint=false]", Err:(*net.OpError)(0xc0000a60a0)} What was the observed behavior? - disk snapshot creation failed Any other relevant details about your implementation or issue? - We are using bosh cpi https://bosh.io/d/github.com/cloudfoundry/bosh-google-cpi-release?v=42.0.0
The ticket with details above was created to GCP and the answer is: “Said that, it seems that the snapshot is getting created successfully, but it seems that the bosh CPI is failing to get the operation information due to some network problems. Moreover, I see different snapshot creations for the same disk and they all were successful. As an example, there was another creation for a snapshot from the same disk at [2022-08-11 11:32:17 UTC] and the operation was successful as well. The snapshot is created by bosh CPI as well and is still there under the snapshot page list. I would suggest to check the connectivity between the source machine that is making the requests and the google API platform, as this seems to be an issue out of the scope of Google Cloud since the snapshots are being created as expected.”
GCP Timeout Errors:
Task 2243498 | 11:53:39 | Updating instance audit_broker: audit_broker/4f6bd68d-ec89-4183-ac9f-6980f32c7c5e (1) (00:04:00) L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Deleting vm 'vm-5195cbb9-d0fa-4686-72e6-b5742e2bb156': Failed to remove Google Instance "vm-5195cbb9-d0fa-4686-72e6-b5742e2bb156" from Target Pool: Failed to remove Google Instance 'vm-5195cbb9-d0fa-4686-72e6-b5742e2bb156' from Target Pool 'broker-gcp-eu30-pool': Google Operation 'operation-1656590145313-5e2a8f32659a1-5e78ca25-dacfd856' finished with an error: Get "https://compute.googleapis.com/compute/v1/projects/gcp/regions/europe-west3/operations/operation-1656590145313-5e2a8f32659a1-5e78ca25-dacfd856?alt=json&prettyPrint=false": read tcp 10.201.2.6:35042->142.250.185.138:443: read: connection reset by peer' in 'delete_vm' CPI method (CPI request ID: 'cpi-855603') Task 2243498 | 11:57:39 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Deleting vm 'vm-5195cbb9-d0fa-4686-72e6-b5742e2bb156': Failed to remove Google Instance "vm-5195cbb9-d0fa-4686-72e6-b5742e2bb156" from Target Pool: Failed to remove Google Instance 'vm-5195cbb9-d0fa-4686-72e6-b5742e2bb156' from Target Pool 'broker-gcp-eu30-pool': Google Operation 'operation-1656590145313-5e2a8f32659a1-5e78ca25-dacfd856' finished with an error: Get "https://compute.googleapis.com/compute/v1/projects/gcp/regions/europe-west3/operations/operation-1656590145313-5e2a8f32659a1-5e78ca25-dacfd856?alt=json&prettyPrint=false": read tcp 10.201.2.6:35042->142.250.185.138:443: read: connection reset by peer' in 'delete_vm' CPI method (CPI request ID: 'cpi-855603')
================================================= L Error: CPI error 'Bosh::Clouds::CloudError' with message 'Attaching disk 'disk-e56f3e21-8eb3-45a8-54f9-c84c51021507' to vm 'vm-8e717d9e-febd-47e3-4bbc-52d622fb486a': Failed to find Google Disk 'disk-e56f3e21-8eb3-45a8-54f9-c84c51021507': Get "https://compute.googleapis.com/compute/v1/projects/gcp/aggregated/disks?alt=json&filter=name+eq+.%2Adisk-e56f3e21-8eb3-45a8-54f9-c84c51021507&prettyPrint=false": oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 142.250.66.10:443: i/o timeout' in 'attach_disk' CPI method (CPI request ID: 'cpi-117739') Task 2341495 | 13:53:52 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Attaching disk 'disk-e56f3e21-8eb3-45a8-54f9-c84c51021507' to vm 'vm-8e717d9e-febd-47e3-4bbc-52d622fb486a': Failed to find Google Disk 'disk-e56f3e21-8eb3-45a8-54f9-c84c51021507': Get "https://compute.googleapis.com/compute/v1/projects/gcp/aggregated/disks?alt=json&filter=name+eq+.%2Adisk-e56f3e21-8eb3-45a8-54f9-c84c51021507&prettyPrint=false": oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token": dial tcp 142.250.66.10:443: i/o timeout' in 'attach_disk' CPI method (CPI request ID: 'cpi-117739')
Is there a way to extend the timeout period or retry mechanism, so even if the update will take longer, it will eventually complete without failing or if you have other suggestions for workaround this kind of network issue.
Best Regards, Deyan