Closed cloudmadlab closed 2 years ago
Can you please retry with the GA build.
You dont have to cleanup each and every files. Just do uninstall-akshci, that should cleanup everything.
I retried with the GA build. This time I used PowerShell instead of WAC.
I believe I am seeing the same issue that I saw when attempting the deployment with WAC. The installation fails during Install-AksHci with "Nodes have not reached Active state." After the "Importing Configuration Completed" step, the installation hangs showing the following on the screen: "Moc - Waiting for cloud nodes to be active..."
I did confirm that the wssdagent and wssdcloudagent were installed on all the nodes in the cluster during the Install-AksHci execution.
Here are the exact commands that I ran:
Import-Module AksHci Get-Command -Module AksHci Initialize-AksHciNode -Verbose
New-Item -Path "C:\ClusterStorage\AXNode02\" -Name "AksHci" -ItemType "directory" -Force -Verbose New-Item -Path "C:\ClusterStorage\AXNode02\AksHci\" -Name "Images" -ItemType "directory" -Force -Verbose New-Item -Path "C:\ClusterStorage\AXNode02\AksHci\" -Name "Config" -ItemType "directory" -Force -Verbose
$vnet = New-AksHciNetworkSetting -Name "mgmtvnet" -vSwitchName "Management" -gateway "192.168.101.1" -dnsservers "192.168.101.15" -ipaddressprefix "192.168.101.0/24" -k8snodeippoolstart "192.168.101.100" -k8snodeippoolend "192.168.101.135" -vipPoolStart "192.168.101.136" -vipPoolEnd "192.168.101.176"
Set-AksHciConfig -vnet $vnet -imageDir "C:\ClusterStorage\AXNode02\AksHci\Images" -cloudConfigLocation "C:\ClusterStorage\AXNode02\AksHci\Config" -cloudservicecidr 192.168.101.24/24 -Verbose
Set-AksHciRegistration -subscriptionId "
Install-AksHci -Verbose
I'm installing it on a 2 nodes Windows Server 2019 failover cluster.
We fixed another issue by granting Create Computer
permissions for the cluster account accountname$
After this I get stopped at the same error:
One quick helper - instead of adding "-verbose" to each cmdlet, please try setting the following in your powershell session:
$VerbosePreference = "continue"
You will then receive the maximum amount of logging available from not only the primary AksHci module cmdlet, but also any other modules/cmdlets it invokes. This should help to provide more insight into the operations being attempted and will certainly provide more logging.
In this case (for the "Nodes have not reached Active state" error) I believe we are polling for all of your nodes/nodeagents to successfully register themselves to the cloudagent service. When you receive this error, it means that one or more of your nodes did not succesfully register with the cloudagent, which could indicate a connectivity issue such as your nodes being unable to communicate with the the cloudagent ip (or a DNS resolution issue even, if the fqdn of the cloudagent role is not resolvable from one or more nodes).
Logs (Get-AksHciLogs) will certainly be the best way for us to diagnose further as I believe we will need to see logs from the nodeagents on each node.
Thanks for the tip on the verbose logging. I will definitely do that going forward.
Here are the logs from the current run. If you need me to Uninstall-AksHci and start again with the $VerbosePreference set, just let me know.
I am attempting to upload the compressed file with the logs, but GitHub gives me an "is not included in the list" error. Please advise.
@cloudmadlab - I've sent Nick logs from my setup. May it's the same root cause and it covers your issue too. Lets wait for an answer...
Thanks, @Elektronenvolt.
@nwoodmsft - Please do let me know if you need my logs or anything else from me.
@cloudmadlab you don't use a proxy for AKS-HCI, but does your 4 node cluster where it's running on have direct Internet access, or do you use a proxy?
I don't have full Internet access and have a proxy configured on the host. There is no issue with AZ login or image download. But, the wssdagent.log
shows: Failed to join CloudAgent ... Starting NodeAgent Stand Alone : Login failed with error: rpc error: code = Unavailable desc = connection error
The connection from the host to the installed cloud service ca-someGuid.domain.com:65000
ends up in the proxy, and our proxy rejects this. The proxy config at the host is set to bypass local IPs, this request should not reach our proxy at all.
May you have the same issue?
@Elektronenvolt - No I do not use a proxy. My nodes have direct Internet access.
My ca-someGuid.domain.com:65000 cloud service is not even being created. No VMs get created whatsoever.
@cloudmadlab - I got first stopped at a permission issue (showed up in the Failover Cluster Manager - Cluster events)
We solved that by granting Create Computer
permissions for the cluster account clustername$
Here's what I did over the weekend to push this forward.
I ran an Uninstall-AksHci to clean up the previous setup. That succeeded without issue.
@Elektronenvolt - Though I didn't see the error you indicated in my Failover Cluster Manager cluster events, I went ahead and granted Create Computer permissions to the cluster account as you suggested.
I re-ran the PowerShell commands that I listed earlier in this bug and added $VerbosePreference = "continue" per your instructions, @nwoodmsft
I observed that a ca-
I still show Moc waiting for cloud nodes to be active for about 30 minutes, and then the installation fails.
I notice that the CA role gets removed when the installation fails. The Get-AksHciLogs is trying to access it to grab logs but can't. Also, Get-AksHciLogs is trying to access the Config\log folder, which also gets deleted after the installation fails. I'm hoping the logs that are collected are enough to give us the necessary clues.
@nwoodmsft - Since I can't seem to upload logs to this bug, please provide the instructions for shipping these logs to you.
@cloudmadlab - to get all logs, you must run Get-AksHciLogs in a second powershell window before this 30 min. timeout.
I have a working installation since today, behind a proxy. During all failed setups I noticed that a Test-NetConnection to the ca and port 65000 always worked. I'm running behind a proxy, and Nick wrote me that cloudagent and nodeagents use proxy values from environment variables. I need to set them with Set-DownloadSdkProxy. Test-NetConnection uses different host proxy settings than the AKS-HCI setup.
My last issue today showed up in the akshcilogs\moc\cloudlogs\wssdcloudagent.log.
Requests to the hostnames ended up in the proxy too:
The following error was encountered while trying to retrieve the URL: hostname:45000
I fixed it by setting the proxy at the 2 host machines with:
Set-DownloadSdkProxy -Scope Machine -Http http://proxyserver:port -Https http://proxyserver:port -NoProxy "localhost,127.0.0.1,.svc,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,hostname1,hostname2,.ourdomain.com"
You don't use a proxy, may you run into a Firewall drop now. My firewall settings: port requirements
Thanks, @Elektronenvolt. I now have all the logs collected during the installation prior to timeout.
I have no firewalls in the infrastructure. My hosts are on the same VLAN as my VMs, so it's likely not a networking issue.
I've provided access to a OneDrive to some engineering folks for review.
@cloudmadlab I see 2 nodes (01, 03) successfully joined the cloud. Looking at the logs to understand whats going on.
@cloudmadlab are all the nodes synced up in time. I see below in the logs of 02, 04 nodeagents
transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2021-06-01T10:56:10-05:00 is before 2021-06-01T15:57:10Z
@madhanrm - I am checking on my time sync right now and will report back shortly. Thank you.
@madhanrm - You were absolutely correct. This was due to a time sync issue in my lab. Thank you for the very prompt response and resolution.
@Elektronenvolt - Thanks for your guidance along the way.
One additional question. After deploying with PowerShell, it appears WAC will not see AKS installed on the AS HCI cluster. Is this correct? I can check the other GitHub issues to research this further, too.
WAC should be able to see the installed AKS cluster, provided same user is used to login.
I deployed the cluster with the same account I use to authenticate to WAC. I would think that I would at least see the management cluster in the Azure Kubernetes Service compute tool in WAC, but I still see the following.
Hey Michael, can you please try the latest build - https://github.com/Azure/aks-hci/releases/tag/AKS-HCI-2106 and let us know if this issue persists?
Hi @abhilashaagarwala - I updated my AKS-HCI cluster with the June update on Friday, and WAC is still not able to see it. I would prefer not to fully re-deploy with a new service host because we are doing a lot of testing right now with a couple different workload clusters. Am I correct in assuming that WAC should be able to see a GA cluster that was deployed with PowerShell?
Yes WAC should be able to! tagging @madhanrm and @mattatmsft to further help you here!
To the point made by @madhanrm earlier, "WAC should be able to see the installed AKS cluster, provided same user is used to login." Not sure what you mean here. I only use one account for everything in my environment, and it's a domain admin. I deployed the AKS-HCI service host with it and login to WAC with it.
@cloudmadlab Hello, I assume this issue is resolved. If so please close it.
@cloudmadlab If you deployed AKS-HCI with powershell, did you specify the -workingDir parameter? If you did not, then the config files are not guaranteed to be present on every node and in this case, WAC is unable to see the deployment.
Just checking. Is this issue still relevant with the latest AKSHCI version. Having similar issue also on a 4 node HCI cluster.
@BartRoels If you deployed AKS-HCI with powershell and then using WAC to see your cluster - yes, the issue is still relevant. Did you specify the -workingDir parameter while using powershell? If you did not, then the config files are not guaranteed to be present on every node and in this case, WAC is unable to see the deployment.
I have the same issue with latest AKS-HCI-2108 release on a multi node cluster. Installed by Powershell and if I try to connect from WAC I see:
But it works on my single node setup:
@Elektronenvolt please see my response above. ^^
What does below return
(Get-MocConfig).WorkingDir
This should point to a CSV path for multinode
Points at both to C:\AksHci
The single node setup has the image store at C:\AksHci\AksHciImageStore
is this reason why it works there?
In multinode, by keeping workingDir in CSV, all the node can see the akshci configuration and can recognize installation state. If we keep it in local storage on one of the node, only that node can see it. for WAC to work well, it can go to anynode and check the status of the akshci installation and not necessarily to the node that the admin has kickstarted the installation on.
From Aug release, we have mandated -workingDir to be in CSV for multinode for fresh install. Let me see if we can provide a way to modify or migrate the existing config from a local path to CSV path
Aaah.. I didn't do a fresh install with the August release on the multi node setup, only an update. I'll re-create the cluster and check again.
@madhasnrm - after a fresh August setup I get c:\ClusterStorage\Volume1\ImageStore
for (Get-MocConfig).WorkingDir
I see the AKS-HCI VMs, but the AKS extension doesn't find a cluster in WAC
Are you able to try the PS cmdlets like get-akshcicluster from other HCI nodes?
Yes.. this works:
PS C:\Windows\system32> Get-AksHciCluster
ProvisioningState : provisioned
KubernetesVersion : v1.20.7-kvapkg.0
NodePools : {linuxnodepool, windowsnodepool}
WindowsNodeCount : 20
LinuxNodeCount : 3
ControlPlaneNodeCount : 3
Name : dev2
It's a WS2019 failover cluster with 2 nodes.
@sushmadharm any ideas here?
@madhanrm we detect existing setup via Get-AksHciConfig. @Elektronenvolt ... does Get-AksHciConfig work from both the nodes?
@sushmadharm - yes, works from both nodes. And the cluster shows up in WAC as well now. It took a while, it's now around 2h after the initial setup had been done.
Thanks @Elektronenvolt . I'll try to provide some script to migrate the workingDir from localpath to CSV, without redeploying
@madhanrm Thanks a lot! I don't need it anymore, both clusters I have had been re-created already.
The nodepool / -enableMonitoring
issue didn't get fixed with a cluster update too. But, may someone else needs to fix an existing installation.
Can we close this bug and open a new one, if installing AKSHCI still fails on 4 node cluster?
Closing this bug due to inactivity and its now a year old.
Description of bug I was running the March preview of AksHci on my 4-node cluster for the last month without any issues. I wanted to uninstall the March release and re-install with the April Release Candidate using Windows Admin Center. The installation of the management cluster with the April RC has failed on two occasions.
Steps to repro The first time I attempted to run Uninstall-AksHci, I did so from the PowerShell Tool in WAC connected to the first node in my cluster. This failed to connect to the other nodes in the cluster to correctly cleanup. I posed the question to the "AKS on AzSHCI Public Community" Teams' General channel for guidance. I was told to follow these manual steps to cleanup the prior installation:
1, _In the registry delete
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\AksHciPS HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MocPS HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\KvaPS
C:\AKSHCI C:\Program Files\AksHci
Under the User who installed AKS-HCI there are a few folders as well. .AksHci d----- 3/17/2021 1:44 PM .kube d----- 3/16/2021 2:35 PM .Kva d----- 3/16/2021 2:35 PM .Moc d----- 3/12/2021 10:19 AM .ssh d----- 3/16/2021 2:36 PM .wssd
Delete all VMs created by AKS-HCI if any are running._
After following these steps, I was able to attempt the installation of the April RC package in WAC, but it failed with the following screenshot.
I was instructed to perform another Uninstall-AksHci. I did this by RDP'ing to the first node in the cluster and executing the command. This time, the Uninstall-AksHci command succeeded. I spot-checked every node and did not find any of the folders above, and the wssdagent and wssdcloudagent services were not found. I also uninstalled all the AksHci, Moc, Kva, and DownloadSdk PowerShell modules from each node.
After running the installation in WAC again, I received the same result as previously.
Environment
Collect log files
I have these logs but am unable to get them attached here. Will try again after submitting the issue.