cloudbase / cloudbase-init

Cross-platform instance initialization
http://openstack.org
Apache License 2.0
414 stars 150 forks source link

Azure WireServer endpoint not found #94

Open jsturtevant opened 2 years ago

jsturtevant commented 2 years ago

Overview

We are seeing random VM timeouts in our k8s cluster api tests, an example is: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1519329033286848512

We are using the UserData plugin to execute run commands and set up ssh keys. Here is an example output from a successful run. If understand correctly, everything that is executed during UserData is running during Azure's "Creating" phase of the vm because it runs during Azures VM provisioning phase.

I've found that this VM is timing out during this Azure VM provisioning phase. When the time out occurs the response from azure is the following and the vm never finishes provision so we are able to get logs.

{\"code\":\"OSProvisioningTimedOut\",\"message\":\"OS Provisioning for VM 'capz-conf-jjd9j' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.\"}]}}",

what is happening

Was able to get a vm that was in this state and found:

2022-04-28 15:24:14.315 4296 DEBUG cloudbaseinit.metadata.services.azureservice [-] Azure WireServer endpoint not found load C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:457
2022-04-28 15:24:14.316 4296 ERROR cloudbaseinit.init [-] No metadata service found: cloudbaseinit.exception.MetadataNotFoundException: No available service found
2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Plugins execution done
2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Stopping Cloudbase-Init service
2022-04-28 15:24:17.316 4296 DEBUG cloudbaseinit.osutils.windows [-] Stopping service cloudbase-init stop_service C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1140

This looks to be triggering from https://github.com/cloudbase/cloudbase-init/blob/39e2f6faf4f8062394bc61ccd0a833e7c8b76917/cloudbaseinit/metadata/services/azureservice.py#L452-L458

which is calling a failing to reach the endpoint over dhcp:

https://github.com/cloudbase/cloudbase-init/blob/39e2f6faf4f8062394bc61ccd0a833e7c8b76917/cloudbaseinit/metadata/services/azureservice.py#L68-L85

It seems because of this exception the service is never setting the failed/succeed of the provisioning step: https://github.com/cloudbase/cloudbase-init/blob/39e2f6faf4f8062394bc61ccd0a833e7c8b76917/cloudbaseinit/metadata/services/azureservice.py#L217-L224

questions

The metadata service ip is well known at 169.254.169.254 why is it querying for it via dhcp? Also it get the wireservice ip of 168.63.129.16 in the success case.

If it can't get it via dhcp, then could if fall back to the well known ips for last try?

When this exception happens shouldn't the failed status be set on the VM provisioning? This way it doesn't get stuck in the Azure Creating step.

Thanks for the help!

additional info

full logs:

cloudbase-init.log

``` 2022-04-28 15:14:06.975 4244 WARNING cloudbaseinit.utils.log [-] Serial port: COM1,115200,N,8 could not be opened: serial.serialutil.SerialException: could not open port 'COM1': FileNotFoundError(2, 'The system cannot find the file specified.', None, 2) 2022-04-28 15:14:06.976 4244 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:07.190 4244 DEBUG cloudbaseinit.osutils.windows [-] Checking if service exists: cloudbase-init check_service_exists C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1092 2022-04-28 15:14:07.193 4244 DEBUG cloudbaseinit.osutils.windows [-] Getting service username: cloudbase-init get_service_username C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1225 2022-04-28 15:14:07.193 4244 DEBUG cloudbaseinit.osutils.windows [-] Resetting password for service user: .\cloudbase-init reset_service_password C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1248 2022-04-28 15:14:07.240 4244 DEBUG cloudbaseinit.osutils.windows [-] Setting service credentials: cloudbase-init set_service_credentials C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1208 2022-04-28 15:14:07.250 4244 INFO cloudbaseinit.init [-] Respawning current process with updated credentials. 2022-04-28 15:14:07.250 4244 DEBUG cloudbaseinit.osutils.windows [-] Creating logon session for user: .\cloudbase-init create_user_logon_session C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:649 2022-04-28 15:14:07.255 4244 DEBUG cloudbaseinit.osutils.windows [-] Executing process as user, command line: ['C:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\Python\\Scripts\\cloudbase-init', '--config-file', 'C:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\conf\\cloudbase-init.conf', '--noreset_service_password'] execute_process_as_user C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:703 2022-04-28 15:14:07.791 4296 WARNING cloudbaseinit.utils.log [-] Serial port: COM1,115200,N,8 could not be opened: serial.serialutil.SerialException: could not open port 'COM1': FileNotFoundError(2, 'The system cannot find the file specified.', None, 2) 2022-04-28 15:14:07.791 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:07.997 4296 INFO cloudbaseinit.init [-] Cloudbase-Init version: 1.1.2 2022-04-28 15:14:07.998 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.createuser.CreateUserPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.085 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.setuserpassword.SetUserPasswordPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.171 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.extendvolumes.ExtendVolumesPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.224 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.userdata.UserDataPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.346 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.ephemeraldisk.EphemeralDiskPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.369 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.azureguestagent.AzureGuestAgentPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.391 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.409 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.457 4296 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_NETWORKING': 2022-04-28 15:14:08.458 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.createuser.CreateUserPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.458 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.setuserpassword.SetUserPasswordPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.458 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.extendvolumes.ExtendVolumesPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.459 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.userdata.UserDataPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.459 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.ephemeraldisk.EphemeralDiskPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.459 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.azureguestagent.AzureGuestAgentPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.460 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.460 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.460 4296 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_METADATA_DISCOVERY': 2022-04-28 15:14:08.460 4296 INFO cloudbaseinit.init [-] Executing plugin 'MTUPlugin' 2022-04-28 15:14:08.461 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.474 4296 DEBUG cloudbaseinit.plugins.common.mtu [-] Could not obtain the MTU configuration via DHCP for interface "60:45:BD:0F:3D:F9" execute C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\plugins\common\mtu.py:46 2022-04-28 15:14:08.475 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.metadata.services.azureservice.AzureService' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.800 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:24:14.315 4296 DEBUG cloudbaseinit.metadata.services.azureservice [-] Azure WireServer endpoint not found load C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:457 2022-04-28 15:24:14.316 4296 ERROR cloudbaseinit.init [-] No metadata service found: cloudbaseinit.exception.MetadataNotFoundException: No available service found 2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Plugins execution done 2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Stopping Cloudbase-Init service 2022-04-28 15:24:17.316 4296 DEBUG cloudbaseinit.osutils.windows [-] Stopping service cloudbase-init stop_service C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1140 ```

cloudbase-init-unattend.log

``` 2022-04-28 15:13:46.682 1476 WARNING cloudbaseinit.utils.log [-] Serial port: COM1,115200,N,8 could not be opened: serial.serialutil.SerialException: could not open port 'COM1': FileNotFoundError(2, 'The system cannot find the file specified.', None, 2) 2022-04-28 15:13:46.682 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.318 1476 DEBUG cloudbaseinit.osutils.windows [-] Checking if service exists: cloudbase-init check_service_exists C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1092 2022-04-28 15:13:49.318 1476 DEBUG cloudbaseinit.osutils.windows [-] Getting service username: cloudbase-init get_service_username C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1225 2022-04-28 15:13:49.318 1476 DEBUG cloudbaseinit.osutils.windows [-] Resetting password for service user: .\cloudbase-init reset_service_password C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1248 2022-04-28 15:13:49.365 1476 DEBUG cloudbaseinit.osutils.windows [-] Setting service credentials: cloudbase-init set_service_credentials C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1208 2022-04-28 15:13:49.381 1476 DEBUG cloudbaseinit.init [-] No need to respawn process. Current user: capz-conf-r7bm9$. Service user: cloudbase-init _reset_service_password_and_respawn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\init.py:141 2022-04-28 15:13:49.381 1476 INFO cloudbaseinit.init [-] Cloudbase-Init version: 1.1.2 2022-04-28 15:13:49.381 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.428 1476 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_NETWORKING': 2022-04-28 15:13:49.428 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.428 1476 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_METADATA_DISCOVERY': 2022-04-28 15:13:49.428 1476 INFO cloudbaseinit.init [-] Executing plugin 'MTUPlugin' 2022-04-28 15:13:49.428 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.428 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.metadata.services.base.EmptyMetadataService' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.475 1476 INFO cloudbaseinit.init [-] Metadata service loaded: 'EmptyMetadataService' 2022-04-28 15:13:49.475 1476 DEBUG cloudbaseinit.init [-] Instance id: None configure_host C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\init.py:202 2022-04-28 15:13:49.475 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.475 1476 INFO cloudbaseinit.init [-] Executing plugins for stage 'MAIN': 2022-04-28 15:13:49.475 1476 INFO cloudbaseinit.init [-] Plugins execution done ```

ader1990 commented 2 years ago

Hello @jsturtevant,

I think we can go forward with solving this issue by using a fallback magic IP in case the DHCP option is not present in the DHCP response.

Although I am afraid this might lead to the same end result, as the OS state might not be ready (basically this issue might happen because of a lazy first boot) or some other networking issue with Azure or the Windows image itself.

We can try a very broad solving approach, where we increase the retry count, improve logging and use the known fallback magic IP.

I can prepare a patch for this purpose as soon as possible. Are you able to build the MSI / new image with a new patch?

Thank you, Adrian

jsturtevant commented 2 years ago

We can try a very broad solving approach, where we increase the retry count, improve logging and use the known fallback magic IP.

sounds good

as the OS state might not be ready (basically this issue might happen because of a lazy first boot) or some other networking issue with Azure or the Windows image itself.

I was wondering the same thing my self. Might even be a race in same way... It only started happening frequently in the last month or so but has gotten to a point where it is troublesome.

We are doing the kubeadm bootstrap work at provisioning time. Is there another approach with cloudbase init, that we could do it just after provision finishes so that it would be avoid doing that work during normal Azure boot process? We also leverage the things like ssh, user creation, password setting from cloudbase init which seems to rely on this Service so even if we move that work out this could still happen if I understand correctly.

Are you able to build the MSI / new image with a new patch?

Are there instructions on how to build the msi? I can easily build a new azure image with a URL to the MSI.

jsturtevant commented 2 years ago

@claudiubelu thought maybe adding a networking service dependency might help https://github.com/kubernetes/kubernetes/issues/109444#issuecomment-1113815615

As to why it didn't detect it needs the Azure Metadata Service: basically, it tries to detect if it can contact the Azure Wire Server by making a DHCP request, but the socket timeout is 5 seconds, and I assume that timed out. I also notice that this is the only dependency for the cloudbase-init: DEPENDENCIES : Winmgmt, which is not Networking related. I think there's a possibility that cloudbase-init started before the networking started, resulting in this issue. A solution might be to add another networking dependency to cloudbase-init, ensuring that it will start afterwards. Other than that, the 5 second timeout could be touched, IMO (maybe the Azure Security Group rules are not applied yet before the VM starts?).

ader1990 commented 2 years ago

Hello @jsturtevant,

I have fixed the github actions in order to produce a cloudbase-init MSI on request if there is the need for it. You can find in the artifacts tab of this github action - https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2262765578, the MSI with the following patches:

In short, I have added more retries, more logging in case we need it, a fallback to the default wireserver IP and to have no chance of transient lazy first boot failures, delayed-start on the cloudbase-init service.

Please let me know how it works and if all in order, I can clean up the patches for upstream merge.

Thank you, Adrian.

jsturtevant commented 2 years ago

Thanks! I am going to build a new vm image for our WS2022 tests where we have seen this so we can test this and the other mitigation I've added to capz.

I will run it for the next week or so and report back.

jsturtevant commented 2 years ago

Just an update, I haven't seen this issue since I've updated the cloudbase-init to use this patch. We've had a few other unrelated issues in CI so going to continue to run for a while longer.

jsturtevant commented 2 years ago

@ader1990 Just got an instance of the issue. This patch seems to have helped but the time out before cutting over to the fall back was longer than our timeout in the tests. The VM didn't get suck in Azure's Creating step.

Job that failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows-2022/1527248958894641152

You can see it fails and starts to collect logs at 12:07:08.273: Capz suite:

STEP: Waiting for the workload nodes to exist
[AfterEach] Conformance Tests
  /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:231
STEP: Dumping logs from the "capz-conf-ew9isz" workload cluster
STEP: Dumping workload cluster capz-conf-ew9isz/capz-conf-ew9isz logs
May 19 12:07:08.273: INFO: Collecting logs for Linux node capz-conf-ew9isz-control-plane-pdqw7 in cluster capz-conf-ew9isz in namespace capz-conf-ew9isz

May 19 12:07:20.205: INFO: Collecting boot logs for AzureMachine capz-conf-ew9isz-control-plane-pdqw7

May 19 12:07:22.425: INFO: Unable to collect logs as node doesn't have addresses
May 19 12:07:22.425: INFO: Collecting logs for Windows node capz-conf-ew9isz-md-win-sj9hc in cluster capz-conf-ew9isz in namespace capz-conf-ew9isz

May 19 12:07:29.660: INFO: Collecting boot logs for AzureMachine capz-conf-ew9isz-md-win-sj9hc

STEP: Redacting sensitive information from logs

• Failure [1898.993 seconds]
Conformance Tests
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:48
  conformance-tests [Measurement]
  /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:103

  Timed out after 1500.000s.
  Expected
      <int>: 0
  to equal
      <int>: 2

  /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.1.2/framework/machinedeployment_helpers.go:121

  Full Stack Trace
  sigs.k8s.io/cluster-api/test/framework.WaitForMachineDeploymentNodesToExist({0x3a3f8a8?, 0xc00013e008}, {{0x7f27b0f7f208?, 0xc000343810?}, 0xc00088aa80?, 0xc0006c0290?}, {0xc0008c12e0, 0x2, 0x2})

In the cloudbase-init logs you can see the DCHP fail and the fallback successfully at 2022-05-19 12:12:20.363 which lets the vm provision properly but not in time for the test which has a generous time out for nodes to start up (25mins)

022-05-19 12:12:20.363 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] Cannot find Azure WireServer endpoint address _get_wire_server_endpoint_address C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:83
2022-05-19 12:12:20.363 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] Azure WireServer endpoint not found. Using default endpoint 168.63.129.16. load C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:460
2022-05-19 12:12:20.363 4428 DEBUG cloudbaseinit.metadata.services.base [-] Executing http request GET at http://168.63.129.16?comp=Versions _http_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\base.py:315
2022-05-19 12:12:20.473 4428 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 168.63.129.16:80 _new_conn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:230
2022-05-19 12:12:20.473 4428 DEBUG urllib3.connectionpool [-] http://168.63.129.16:80 "GET /?comp=Versions HTTP/1.1" 200 510 _make_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:442
2022-05-19 12:12:20.535 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] ovs-env path: E:\ovf-env.xml _get_ovf_env_path C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:341
2022-05-19 12:12:20.551 4428 INFO cloudbaseinit.init [-] Metadata service loaded: 'AzureService'
2022-05-19 12:12:20.551 4428 INFO cloudbaseinit.init [-] Reporting provisioning started
2022-05-19 12:12:20.551 4428 DEBUG cloudbaseinit.metadata.services.base [-] Executing http request GET at http://168.63.129.16/machine?comp=goalstate _http_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\base.py:315
2022-05-19 12:12:20.551 4428 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 168.63.129.16:80 _new_conn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:230
2022-05-19 12:12:20.561 4428 DEBUG urllib3.connectionpool [-] http://168.63.129.16:80 "GET /machine?comp=goalstate HTTP/1.1" 200 2154 _make_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:442
2022-05-19 12:12:20.567 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] Health data: b"<?xml version='1.0' encoding='utf-8'?>\n<Health><GoalStateIncarnation>1</GoalStateIncarnation><Container><ContainerId>d478084b-e06f-419b-9a36-9ae68ca62648</ContainerId><RoleInstanceList><Role><InstanceId>0c07e5b6-255e-4f12-9fd0-e6f42d491599._capz-conf-sj9hc</InstanceId><Health><State>NotReady</State><Details><SubStatus>Provisioning</SubStatus><Description>Cloudbase-Init is preparing your computer for first use...</Description></Details></Health></Role></RoleInstanceList></Container></Health>" _post_health_status C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:210
2022-05-19 12:12:20.567 4428 DEBUG cloudbaseinit.metadata.services.base [-] Executing http request POST at http://168.63.129.16/machine?comp=health _http_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\base.py:315
2022-05-19 12:12:20.567 4428 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 168.63.129.16:80 _new_conn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:230
2022-05-19 12:12:20.567 4428 DEBUG urllib3.connectionpool [-] http://168.63.129.16:80 "POST /machine?comp=health HTTP/1.1" 200 0 _make_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:442
2022-05-19 12:12:20.567 4428 DEBUG cloudbaseinit.init [-] Instance id: 0c07e5b6-255e-4f12-9fd0-e6f42d491599._capz-conf-sj9hc configure_host C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\init.py:202
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.createuser.CreateUserPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.setuserpassword.SetUserPasswordPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.extendvolumes.ExtendVolumesPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.userdata.UserDataPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.ephemeraldisk.EphemeralDiskPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.azureguestagent.AzureGuestAgentPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.598 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27

full log here: https://gist.github.com/jsturtevant/2ed46f8cc9f52fe2a767a2f698e45377

jsturtevant commented 2 years ago

/cc @marosset

ader1990 commented 2 years ago

Hello,

The wait time now is around 20 minutes just for the retries to get the dhcp option, I can make another patch with way faster falling back to the default IP (maybe /4 the time to 5 minutes?). Would that be preferable for your case?

jsturtevant commented 2 years ago

What would be the long term fix for cloudbase-init? It seems like the delay-start for the service helped. Would taking a dependency on a networking service as suggested with https://github.com/kubernetes/kubernetes/issues/109444#issuecomment-1113815615 be viable?

My main concern with using a 5 minute timeout is that this seems the dhcp has a failure mode that is not recoverable the way it is executed now, so if we do hit this edge case we have an outlying VM startup time that will take 5+ minutes to provision which isn't a great experience for the end user.

jsturtevant commented 2 years ago

@ader1990 any thoughts here?

marosset commented 2 years ago

I think taking a dependency on is a good idea. It seems safer and more deterministic vs waits + retries.

ader1990 commented 2 years ago

Hello,

I have created a new MSI installer using the github actions, downloadable from here https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2426984970 .

Implemented changes:

Branches:

@jsturtevant we have already given this version of the MSI some runs on Azure and it works as expected. Once we get a confirmation it works well, we can merge the two fixes upstream asap.

Thank you, Adrian Vladu

jsturtevant commented 2 years ago

I will get an image into testing and report back but otherwise looks good to me

jsturtevant commented 2 years ago

The CI now has the patch, we can let it run over the weekend and check in.

jsturtevant commented 2 years ago

@ader1990 We are still seeing failures with node coming online with the latest patches. It is happening more frequently then the prior patch with had the delayed start to the service. Seems that might be the best approach?

I am trying to collect logs but it seems the serial com port logging is not working. Is this the correct configuration https://github.com/kubernetes-sigs/image-builder/pull/875? Also can you confirm the api where we can get the logs? We are using https://docs.microsoft.com/en-us/azure/virtual-machines/boot-diagnostics via https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/0ecbf62ecca79af77d1d902a9e8d46c0a0ba38c1/test/e2e/azure_logcollector.go#L416

thanks for the help!

ader1990 commented 2 years ago

Hello @jsturtevant,

First of all, maybe there are some other issues that we need to investigate, there might be a posibility the entire Windows networking is not available till a point later in the boot sequence, but we need logs to get to the bottom of it.

COM2 is the port opened for serial logging, and it should work AFAIK, so should the azure collector logs.

What we can do is to try to reproduce this bug in isolation. In order to do this, can you please share the VM flavor being used and the region of Azure, if it s possible.

In the meantime, maybe you can leverage the older MSI / image with higher timeouts, but if I remember correctly, there was the need to increase a timeout in the testing framework too?

Thank you, Adrian Vladu

jsturtevant commented 2 years ago

We run the tests in the following random regions: "canadacentral" "centralus" "eastus" "eastus2" "northeurope" "uksouth" "westeurope" "westus2" "westus3". I've seen it in all the regions but recent instances where in eastus, eastus2, centralus, and westus3

vm size: Standard_D4s_v3

In the meantime, maybe you can leverage the older MSI / image with higher timeouts, but if I remember correctly, there was the need to increase a timeout in the testing framework too?

Yes I am going to try this.

jsturtevant commented 2 years ago

COM2 is the port opened for serial logging, and it should work AFAIK, so should the azure collector logs.

I am not seeing the logs show up with by using that API, when you do some further testing let me know if you are seeing them there or if I am using the wrong API.

ader1990 commented 2 years ago

I have also created another MSI with automatic delayed start re-enabled. I think this hybrid approach would be best while we work to reproduce the issue in isolation. https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2576044017 .

I will try to reproduce this issue and will verify the COM port behavious asap.

Thanks.

jsturtevant commented 2 years ago

@ader1990 any luck reproducing?

Since taking the build from https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2576044017 I haven't seen any provisioning errors related to this.

marosset commented 1 year ago

@jsturtevant @ader1990 We've been running with these changes in CAPZ for about a month now and have seen a reduction in cluster provisioning errors. Can we get a release with these changes in it? Thanks!

ader1990 commented 1 year ago

Hello,

There is work in progress to get the cloudbase-init patch merged here:

https://review.opendev.org/c/x/cloudbase-init/+/856306

If you can take a look and give a +1, it would be great. I expect no other functional additions to it, maybe some nitpicking on the code review, which should not take long.

Thank you, Adrian Vladu

jsturtevant commented 1 year ago

I don't seem to be able to plus one there but I'll give it here. We've been running this patch for a while now and no longer see the issues.

marosset commented 1 year ago

Same, +1 for me!

ader1990 commented 1 year ago

Hello @jsturtevant @marosset,

The wireserver patch has been merged, also a few other patches. We are switching to Python 3.10, I will update here once the new MSI containing the patches and Py3.10 is released.

https://github.com/cloudbase/cloudbase-init-installer/pull/28

Thank you, Adrian Vladu

jsturtevant commented 1 year ago

@ader1990 Any update on the release?

yashwanth-mannem commented 1 year ago

@ader1990, @jsturtevant How could I download the .msi for the above patch? The github actions artifact is expired on the above mentioned links

ader1990 commented 1 year ago

Hello @yashwanth-mannem, the beta cloudbase-init installer contains the stable Python version [3.10.7] and the required patch for this issue: https://www.cloudbase.it/downloads/CloudbaseInitSetup_x64.msi

@jsturtevant can you please check that the latest version works for you?

Thank you, Adrian Vladu

jsturtevant commented 1 year ago

We ran it in our test suite and looks to be passing! Let's ship it 🚢!

yashwanth-mannem commented 1 year ago

@ader1990 The beta cloudbase-init installer provided above is failing to provision the vm successfully as the vmagent on the vm is in 'NotReady' status. Do you have insights into how to fix the issue?

jsturtevant commented 1 year ago

looks like this got released in https://github.com/cloudbase/cloudbase-init/releases/tag/1.1.4

@ader1990 can we close this?

marosset commented 1 year ago

+1 for closing this issue!

rahulgupta999 commented 1 year ago

The issue is frequently reproducible in my environment. I am using v1.1.4 with Windows Server 2022 core on Azure. The CAPZ extension timeout is 300s, which is lower than what Cloud-init retries take to fetch the wire server address. Each attempt by cloud-init is taking 7sec.
I talked to Azure support and they mentioned that these retries should not be required. The wire server address is fixed.

jsturtevant commented 1 year ago

@rahulgupta999 I don't know that we are seeing this issue any more. Is it something different?

rahulgupta999 commented 1 year ago

@jsturtevant I could see this issue reproducing around 4/10 times with cloudbase-init v1.1.4 build. Finally, as a workaround, I modified the cloudbase-init installation to reduce the number of wire server endpoint detection. This way the fallback to the known IP happens faster and the CAPZ bootstrap VM extension does not timeout. After this change, I have not seen this issue pop up in over 100 VM launches. I am not sure why this logic exists in the first place. If there is a need we should look for some other alternative. One may be simply query/ping the fixed wire server address.

I also debugged this issue with Azure Premier support and they suggested no need for such logic to detect the wire server endpoint as that is a fixed IP.

ader1990 commented 1 year ago

Hello,

If the problem persists, I can prepare a new patch that removes the logic to detect the wire server endpoint. The logic was there due to legacy reasons / compatibility with cloud-init implementation / flexibility for testing.

Thank you, Adrian Vladu