Cannot deploy AksHci RC on 4-node cluster

cloudmadlab commented 3 years ago

Description of bug I was running the March preview of AksHci on my 4-node cluster for the last month without any issues. I wanted to uninstall the March release and re-install with the April Release Candidate using Windows Admin Center. The installation of the management cluster with the April RC has failed on two occasions.

Steps to repro The first time I attempted to run Uninstall-AksHci, I did so from the PowerShell Tool in WAC connected to the first node in my cluster. This failed to connect to the other nodes in the cluster to correctly cleanup. I posed the question to the "AKS on AzSHCI Public Community" Teams' General channel for guidance. I was told to follow these manual steps to cleanup the prior installation:

1, _In the registry delete

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\AksHciPS HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MocPS HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\KvaPS

Delete the following folders and subfolders on the host machine:

C:\AKSHCI C:\Program Files\AksHci

Under the User who installed AKS-HCI there are a few folders as well. .AksHci d----- 3/17/2021 1:44 PM .kube d----- 3/16/2021 2:35 PM .Kva d----- 3/16/2021 2:35 PM .Moc d----- 3/12/2021 10:19 AM .ssh d----- 3/16/2021 2:36 PM .wssd
Delete all VMs created by AKS-HCI if any are running._

After following these steps, I was able to attempt the installation of the April RC package in WAC, but it failed with the following screenshot.

I was instructed to perform another Uninstall-AksHci. I did this by RDP'ing to the first node in the cluster and executing the command. This time, the Uninstall-AksHci command succeeded. I spot-checked every node and did not find any of the folders above, and the wssdagent and wssdcloudagent services were not found. I also uninstalled all the AksHci, Moc, Kva, and DownloadSdk PowerShell modules from each node.

After running the installation in WAC again, I received the same result as previously.

Environment

OS: Windows 10
Browser - Microsoft Edge
Version - 90.0.818.62
AKS-HCI Version - April RC
WAC version - 2103.2 Build 1.3.2104.20002
Azure Stack HCI OS 20H2

Collect log files

From a PowerShell Admin window run Get-AksHciLogs
If you are running into issues with the deployment wizard in Windows Admin Center, run Get-SMEUILogs.ps1 from the machine hosting Windows Admin Center.”

I have these logs but am unable to get them attached here. Will try again after submitting the issue.

madhanrm commented 3 years ago

Can you please retry with the GA build.

You dont have to cleanup each and every files. Just do uninstall-akshci, that should cleanup everything.

cloudmadlab commented 3 years ago

I retried with the GA build. This time I used PowerShell instead of WAC.

I believe I am seeing the same issue that I saw when attempting the deployment with WAC. The installation fails during Install-AksHci with "Nodes have not reached Active state." After the "Importing Configuration Completed" step, the installation hangs showing the following on the screen: "Moc - Waiting for cloud nodes to be active..."

I did confirm that the wssdagent and wssdcloudagent were installed on all the nodes in the cluster during the Install-AksHci execution.

cloudmadlab commented 3 years ago

Here are the exact commands that I ran:

Import-Module AksHci Get-Command -Module AksHci Initialize-AksHciNode -Verbose

New-Item -Path "C:\ClusterStorage\AXNode02\" -Name "AksHci" -ItemType "directory" -Force -Verbose New-Item -Path "C:\ClusterStorage\AXNode02\AksHci\" -Name "Images" -ItemType "directory" -Force -Verbose New-Item -Path "C:\ClusterStorage\AXNode02\AksHci\" -Name "Config" -ItemType "directory" -Force -Verbose

$vnet = New-AksHciNetworkSetting -Name "mgmtvnet" -vSwitchName "Management" -gateway "192.168.101.1" -dnsservers "192.168.101.15" -ipaddressprefix "192.168.101.0/24" -k8snodeippoolstart "192.168.101.100" -k8snodeippoolend "192.168.101.135" -vipPoolStart "192.168.101.136" -vipPoolEnd "192.168.101.176"

Set-AksHciConfig -vnet $vnet -imageDir "C:\ClusterStorage\AXNode02\AksHci\Images" -cloudConfigLocation "C:\ClusterStorage\AXNode02\AksHci\Config" -cloudservicecidr 192.168.101.24/24 -Verbose

Set-AksHciRegistration -subscriptionId "" -resourceGroupName "AksHciGA-rg" -UseDeviceAuthentication

Install-AksHci -Verbose

Elektronenvolt commented 3 years ago

I'm installing it on a 2 nodes Windows Server 2019 failover cluster. We fixed another issue by granting Create Computer permissions for the cluster account accountname$ After this I get stopped at the same error:

nwoodmsft commented 3 years ago

One quick helper - instead of adding "-verbose" to each cmdlet, please try setting the following in your powershell session:

$VerbosePreference = "continue"

You will then receive the maximum amount of logging available from not only the primary AksHci module cmdlet, but also any other modules/cmdlets it invokes. This should help to provide more insight into the operations being attempted and will certainly provide more logging.

In this case (for the "Nodes have not reached Active state" error) I believe we are polling for all of your nodes/nodeagents to successfully register themselves to the cloudagent service. When you receive this error, it means that one or more of your nodes did not succesfully register with the cloudagent, which could indicate a connectivity issue such as your nodes being unable to communicate with the the cloudagent ip (or a DNS resolution issue even, if the fqdn of the cloudagent role is not resolvable from one or more nodes).

Logs (Get-AksHciLogs) will certainly be the best way for us to diagnose further as I believe we will need to see logs from the nodeagents on each node.

cloudmadlab commented 3 years ago

Thanks for the tip on the verbose logging. I will definitely do that going forward.

Here are the logs from the current run. If you need me to Uninstall-AksHci and start again with the $VerbosePreference set, just let me know.

I am attempting to upload the compressed file with the logs, but GitHub gives me an "is not included in the list" error. Please advise.

Elektronenvolt commented 3 years ago

@cloudmadlab - I've sent Nick logs from my setup. May it's the same root cause and it covers your issue too. Lets wait for an answer...

cloudmadlab commented 3 years ago

Thanks, @Elektronenvolt.

@nwoodmsft - Please do let me know if you need my logs or anything else from me.

Elektronenvolt commented 3 years ago

@cloudmadlab you don't use a proxy for AKS-HCI, but does your 4 node cluster where it's running on have direct Internet access, or do you use a proxy?

I don't have full Internet access and have a proxy configured on the host. There is no issue with AZ login or image download. But, the wssdagent.log shows: Failed to join CloudAgent ... Starting NodeAgent Stand Alone : Login failed with error: rpc error: code = Unavailable desc = connection error The connection from the host to the installed cloud service ca-someGuid.domain.com:65000 ends up in the proxy, and our proxy rejects this. The proxy config at the host is set to bypass local IPs, this request should not reach our proxy at all.

May you have the same issue?

cloudmadlab commented 3 years ago

@Elektronenvolt - No I do not use a proxy. My nodes have direct Internet access.

My ca-someGuid.domain.com:65000 cloud service is not even being created. No VMs get created whatsoever.

Elektronenvolt commented 3 years ago

@cloudmadlab - I got first stopped at a permission issue (showed up in the Failover Cluster Manager - Cluster events) We solved that by granting Create Computer permissions for the cluster account clustername$

cloudmadlab commented 3 years ago

Here's what I did over the weekend to push this forward.

I ran an Uninstall-AksHci to clean up the previous setup. That succeeded without issue.
@Elektronenvolt - Though I didn't see the error you indicated in my Failover Cluster Manager cluster events, I went ahead and granted Create Computer permissions to the cluster account as you suggested.
I re-ran the PowerShell commands that I listed earlier in this bug and added $VerbosePreference = "continue" per your instructions, @nwoodmsft
I observed that a ca- role is being created in Failover Cluster Manager during the installation. I also confirmed that all the nodes in the cluster can successfully communicate with this CA role by IP and DNS hostname via a Test-NetConnection to port 65000.
I still show Moc waiting for cloud nodes to be active for about 30 minutes, and then the installation fails.

I notice that the CA role gets removed when the installation fails. The Get-AksHciLogs is trying to access it to grab logs but can't. Also, Get-AksHciLogs is trying to access the Config\log folder, which also gets deleted after the installation fails. I'm hoping the logs that are collected are enough to give us the necessary clues.

@nwoodmsft - Since I can't seem to upload logs to this bug, please provide the instructions for shipping these logs to you.

Elektronenvolt commented 3 years ago

@cloudmadlab - to get all logs, you must run Get-AksHciLogs in a second powershell window before this 30 min. timeout.

I have a working installation since today, behind a proxy. During all failed setups I noticed that a Test-NetConnection to the ca and port 65000 always worked. I'm running behind a proxy, and Nick wrote me that cloudagent and nodeagents use proxy values from environment variables. I need to set them with Set-DownloadSdkProxy. Test-NetConnection uses different host proxy settings than the AKS-HCI setup.

My last issue today showed up in the akshcilogs\moc\cloudlogs\wssdcloudagent.log. Requests to the hostnames ended up in the proxy too: The following error was encountered while trying to retrieve the URL: hostname:45000

I fixed it by setting the proxy at the 2 host machines with: Set-DownloadSdkProxy -Scope Machine -Http http://proxyserver:port -Https http://proxyserver:port -NoProxy "localhost,127.0.0.1,.svc,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,hostname1,hostname2,.ourdomain.com"

You don't use a proxy, may you run into a Firewall drop now. My firewall settings: port requirements

cloudmadlab commented 3 years ago

Thanks, @Elektronenvolt. I now have all the logs collected during the installation prior to timeout.

I have no firewalls in the infrastructure. My hosts are on the same VLAN as my VMs, so it's likely not a networking issue.

I've provided access to a OneDrive to some engineering folks for review.

madhanrm commented 3 years ago

@cloudmadlab I see 2 nodes (01, 03) successfully joined the cloud. Looking at the logs to understand whats going on.

madhanrm commented 3 years ago

@cloudmadlab are all the nodes synced up in time. I see below in the logs of 02, 04 nodeagents

transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2021-06-01T10:56:10-05:00 is before 2021-06-01T15:57:10Z

cloudmadlab commented 3 years ago

@madhanrm - I am checking on my time sync right now and will report back shortly. Thank you.

cloudmadlab commented 3 years ago

@madhanrm - You were absolutely correct. This was due to a time sync issue in my lab. Thank you for the very prompt response and resolution.

@Elektronenvolt - Thanks for your guidance along the way.

One additional question. After deploying with PowerShell, it appears WAC will not see AKS installed on the AS HCI cluster. Is this correct? I can check the other GitHub issues to research this further, too.

madhanrm commented 3 years ago

WAC should be able to see the installed AKS cluster, provided same user is used to login.

cloudmadlab commented 3 years ago

I deployed the cluster with the same account I use to authenticate to WAC. I would think that I would at least see the management cluster in the Azure Kubernetes Service compute tool in WAC, but I still see the following.

abhilashaagarwala commented 3 years ago

Hey Michael, can you please try the latest build - https://github.com/Azure/aks-hci/releases/tag/AKS-HCI-2106 and let us know if this issue persists?

cloudmadlab commented 3 years ago

Hi @abhilashaagarwala - I updated my AKS-HCI cluster with the June update on Friday, and WAC is still not able to see it. I would prefer not to fully re-deploy with a new service host because we are doing a lot of testing right now with a couple different workload clusters. Am I correct in assuming that WAC should be able to see a GA cluster that was deployed with PowerShell?

abhilashaagarwala commented 3 years ago

Yes WAC should be able to! tagging @madhanrm and @mattatmsft to further help you here!

cloudmadlab commented 3 years ago

To the point made by @madhanrm earlier, "WAC should be able to see the installed AKS cluster, provided same user is used to login." Not sure what you mean here. I only use one account for everything in my environment, and it's a domain admin. I deployed the AKS-HCI service host with it and login to WAC with it.

mkostersitz commented 3 years ago

@cloudmadlab Hello, I assume this issue is resolved. If so please close it.

sushmadharm commented 3 years ago

@cloudmadlab If you deployed AKS-HCI with powershell, did you specify the -workingDir parameter? If you did not, then the config files are not guaranteed to be present on every node and in this case, WAC is unable to see the deployment.

BartRoels commented 3 years ago

Just checking. Is this issue still relevant with the latest AKSHCI version. Having similar issue also on a 4 node HCI cluster.

sushmadharm commented 3 years ago

@BartRoels If you deployed AKS-HCI with powershell and then using WAC to see your cluster - yes, the issue is still relevant. Did you specify the -workingDir parameter while using powershell? If you did not, then the config files are not guaranteed to be present on every node and in this case, WAC is unable to see the deployment.

Elektronenvolt commented 3 years ago

I have the same issue with latest AKS-HCI-2108 release on a multi node cluster. Installed by Powershell and if I try to connect from WAC I see:

But it works on my single node setup:

sushmadharm commented 3 years ago

@Elektronenvolt please see my response above. ^^

madhanrm commented 3 years ago

What does below return (Get-MocConfig).WorkingDir

This should point to a CSV path for multinode

Elektronenvolt commented 3 years ago

Points at both to C:\AksHci The single node setup has the image store at C:\AksHci\AksHciImageStore is this reason why it works there?

madhanrm commented 3 years ago

In multinode, by keeping workingDir in CSV, all the node can see the akshci configuration and can recognize installation state. If we keep it in local storage on one of the node, only that node can see it. for WAC to work well, it can go to anynode and check the status of the akshci installation and not necessarily to the node that the admin has kickstarted the installation on.

madhanrm commented 3 years ago

From Aug release, we have mandated -workingDir to be in CSV for multinode for fresh install. Let me see if we can provide a way to modify or migrate the existing config from a local path to CSV path

Elektronenvolt commented 3 years ago

Aaah.. I didn't do a fresh install with the August release on the multi node setup, only an update. I'll re-create the cluster and check again.

Elektronenvolt commented 3 years ago

@madhasnrm - after a fresh August setup I get c:\ClusterStorage\Volume1\ImageStore for (Get-MocConfig).WorkingDir I see the AKS-HCI VMs, but the AKS extension doesn't find a cluster in WAC

madhanrm commented 3 years ago

Are you able to try the PS cmdlets like get-akshcicluster from other HCI nodes?

Elektronenvolt commented 3 years ago

Yes.. this works:

PS C:\Windows\system32> Get-AksHciCluster
ProvisioningState     : provisioned
KubernetesVersion     : v1.20.7-kvapkg.0
NodePools             : {linuxnodepool, windowsnodepool}
WindowsNodeCount      : 20
LinuxNodeCount        : 3
ControlPlaneNodeCount : 3
Name                  : dev2

It's a WS2019 failover cluster with 2 nodes.

madhanrm commented 3 years ago

@sushmadharm any ideas here?

sushmadharm commented 3 years ago

@madhanrm we detect existing setup via Get-AksHciConfig. @Elektronenvolt ... does Get-AksHciConfig work from both the nodes?

Elektronenvolt commented 3 years ago

@sushmadharm - yes, works from both nodes. And the cluster shows up in WAC as well now. It took a while, it's now around 2h after the initial setup had been done.

madhanrm commented 3 years ago

Thanks @Elektronenvolt . I'll try to provide some script to migrate the workingDir from localpath to CSV, without redeploying

Elektronenvolt commented 3 years ago

@madhanrm Thanks a lot! I don't need it anymore, both clusters I have had been re-created already. The nodepool / -enableMonitoring issue didn't get fixed with a cluster update too. But, may someone else needs to fix an existing installation.

madhanrm commented 3 years ago

Can we close this bug and open a new one, if installing AKSHCI still fails on 4 node cluster?

abhilashaagarwala commented 2 years ago

Closing this bug due to inactivity and its now a year old.

Azure / aksArc

Cannot deploy AksHci RC on 4-node cluster #96