bfrankMS / CreateHypervVms

8 stars 7 forks source link

Keep getting this error while deploying HCI: `ALM was unable to converge on (virtual) node` #3

Open kCyborg opened 4 months ago

kCyborg commented 4 months ago

Hi there Frank.

This is not an issue, but a cry for help because I really don't know where to ask for help.

First of all, I wanna say thanks so much for your work and publishing it so we can test it. I have learn A LOT by just reading your PS scripts!!! Great and amazing work!!!

Welp, I have been following your YouTube series about the Virtual Azure HCI Stack environment (you have a new subscriber btw) and I tried to replicated but with smalls and few tweaks. I will try to explain ithopping you can help me to debug where my error is.


My lab environment

In my home lab I don't have Windows Servers on the node in order to virtualize the VMs, but an open source alternative called OpenNebula, it works well and have been using it for the last 3 years in order to get my private cloud working with few troubles. I'm not advertising, I just wanna let you know that the hypervisor works without problems.

Over my private cloud I created a VM with Windows 2019 (ServerStandard Edition) and I was using this VM to runs the PS scripts you use on your guide.

I left you a little map of what I have which is pretty similar to what you propose:

image


My error

I saw your videos and read this github repo more than a couple of times, the ISOs I'm using in order to create the .vhdx images are the same you suggest on your videos (windows server 2022 and hci 23h2), all of the VMs can reach the internet (ping at ibm.com works correctly), I disabled the IPv6 on hci nodes as suggested...

The deployment, if we use your video as reference, stop with an error at this point.

Follow all your suggestions and instructions, but the deployment keeps failing at the same point, and it's in the end of the deployment, when I try to deploy from the Azure Arc Dashboard I keep getting;

image

And if I click on the View Details I get:

image

Now, If I get into the mentioned HCI (virtual) node and check for the logs I get:

PS C:\CloudDeployment\Logs> Get-Content .\CloudDeployment.2024-05-11.08-45-13.0.log -Tail 100 -Wait

2024-05-11 11:41:07 Verbose  [AgentLifecycleManager:InstallAllAgents] Invoke-ScriptBlockWithRetries: retry #0 of 5, retry sleep time is 5 seconds.
2024-05-11 11:41:09 Verbose  [AgentLifecycleManager:InstallAllAgents] Disabled Agent Update Trigger at 'C:\Agents\Orchestration\AgentLifecycleManagement\AgentUpdateTriggerDirectory_BareMetal_HCI\AgentUpdateTrigger.2024.511.1610.18047.json' on Node 00-HCI-101
2024-05-11 11:41:14 Verbose  [AgentLifecycleManager:InstallAllAgents] Invoke-ScriptBlockWithRetries: retry #0 of 5, retry sleep time is 5 seconds.
2024-05-11 11:41:16 Verbose  [AgentLifecycleManager:InstallAllAgents] Disabled Agent Update Trigger at 'C:\Agents\Orchestration\AgentLifecycleManagement\AgentUpdateTriggerDirectory_BareMetal_HCI\AgentUpdateTrigger.2024.511.1610.18047.json' on Node 00-HCI-102
2024-05-11 11:41:17 Verbose  [AgentLifecycleManager:InstallAllAgents] ECE PowerShellHost Script End
2024-05-11 11:41:17 Verbose  Interface: PowerShell job terminal state is Failed for interface 'InstallAllAgents'.
2024-05-11 11:41:17 Verbose  [PSTask Concurrency] [52>AgentLifecycleManager:InstallAllAgents] Exiting throttling instance. Priority '0', Weight '10'.
2024-05-11 11:41:17 Verbose  [PSTask Concurrency] [52>AgentLifecycleManager:InstallAllAgents] Parallel task count decreased by '10' to '0'.
2024-05-11 11:41:17 Warning  Task: Invocation of interface 'InstallAllAgents' of role 'Cloud\Fabric\AgentLifecycleManager' failed:

Type 'InstallAllAgents' of Role 'AgentLifecycleManager' raised an exception:

ALM was unable to converge on 00-HCI-102. Error reading subsequent results from C:\Agents\Orchestration\AgentLifecycleManagement\AgentInstallationSuccessIndicatorDirectory_BareMetal_HCI\AgentInstallationResults-00-HCI-102.2024.511.1610.18047.json Object reference not set to an instance of an object.
Command                     Arguments
-------                     ---------
Report-AgentFailures        {FailedNodeNames=00-HCI-102 00-HCI-101, RoleName=BareMetal, AgentInternalRoleName=BareMet...
                            {}
                            {}
<ScriptBlock>               {CloudEngine.Configurations.EceInterfaceParameters, AgentLifecycleManager, InstallAllAgen...
Invoke-EceInterfaceInternal {CloudDeploymentModulePath=C:\NugetStore\Microsoft.AzureStack.Solution.Deploy.CloudDeploy...

#(.......)

2024-05-11 11:41:17 Verbose  Action: Action plan reached terminal state due to one of the steps. Finish running all remaining steps that may still be currently in progress before exiting.
2024-05-11 11:41:17 Verbose  Draining all steps that are still in progress. The following steps are still in progress or just completed: '0.31.1'.
2024-05-11 11:41:17 Verbose  Action: Action plan 'SecondPhaseDeploymentBareMetal' failed.
2024-05-11 11:41:17 Error    Action: Invocation of step 0.31.1 failed. Stopping invocation of action plan.
2024-05-11 11:41:17 Verbose  Action: Status of 'SecondPhaseDeploymentBareMetal' is 'Error'.
2024-05-11 11:41:17 Verbose  Task: Status of action 'SecondPhaseDeploymentBareMetal' of role 'Cloud\Fabric\AgentLifecycleManager' is 'Error'.
2024-05-11 11:41:17 Verbose  Step: Status of step '0.31 - Complete the update orchestrator installation' is 'Error'.
2024-05-11 11:41:17 Verbose  Checking if any of the in progress steps are complete. The following steps are currently in progress: '0.31'.
2024-05-11 11:41:17 Verbose  Action: Action plan reached terminal state due to one of the steps. Finish running all remaining steps that may still be currently in progress before exiting.
2024-05-11 11:41:17 Verbose  Draining all steps that are still in progress. The following steps are still in progress or just completed: '0.31'.
2024-05-11 11:41:17 Verbose  Action: Action plan 'FullCloudDeployment' failed.
2024-05-11 11:41:17 Error    Action: Invocation of step 0.31 failed. Stopping invocation of action plan.
2024-05-11 11:41:17 Verbose  Action: Status of 'FullCloudDeployment' is 'Error'.
2024-05-11 11:41:17 Verbose  Task: Status of action 'CustomAction' of role 'Cloud' is 'Error'.
2024-05-11 11:41:17 Verbose  ConditionalTask: Status of action 'CustomAction' of role 'Cloud' is 'Error'.
2024-05-11 11:41:17 Verbose  Step: Status of step '0 - Deploy Azure Stack HCI' is 'Error'.
2024-05-11 11:41:17 Verbose  Checking if any of the in progress steps are complete. The following steps are currently in progress: '0'.
2024-05-11 11:41:17 Verbose  Action: Action plan reached terminal state due to one of the steps. Finish running all remaining steps that may still be currently in progress before exiting.
2024-05-11 11:41:17 Verbose  Draining all steps that are still in progress. The following steps are still in progress or just completed: '0'.
2024-05-11 11:41:17 Verbose  Action: Action plan 'CloudDeployment' failed.
2024-05-11 11:41:17 Error    Action: Invocation of step 0 failed. Stopping invocation of action plan.

The logs are too way big to put in here, I would like to keep this as clean as possible, but if you need more logs, please let me know


I honestly don't have too much experience on the Windows world, but I would love to get this thing working.

So can you lend me a hand?

Franco-Sparrow commented 4 months ago

I can confirm your issue @kCyborg It happened the same on my lab. Waiting for feedback to solve thsi ASAP, please :)

fareedhayat1 commented 1 week ago

Hi @kCyborg, can you please help me out with the IP configurations in file 3_postinstallscript. should i get the ipconfig of the host vm and add it them accordingly their or what should i do. Currently i got the IP for external switch and staying in that range entered the IP address and for the default gateway kept it same as the external switch in the ipconfig in host VM. but it doesnt allow any of the VMs to get response back from ping ibm.com command. Any help would really be helpful. @kCyborg, @Franco-Sparrow, @mbrat2005, @bfrankMS