Open raghavendra-nataraj opened 2 years ago
Should this be deleted on all nodes?
Only on the node seeing the certificate expiry issue
The file $env:UserProfile.wssd\kvactl\cloudconfig
contents are like this format:
Do you mean that we have to delete the block <SOME-CERTIFICATE-DATA>
? I tried this solution, and the above error is shown:
There is no other certificate file at this location.
Please can you delete the entire file and try again.
Ok so I Deleted the "cloudconfig" file. After Update-AksHci
a new "cloudconfig" file is generated, and the below error is shown:
Looking at the other error in the bug you posted it looks like the KMS is down as it crossed the 60 day expiry. Can you do verify that by running the below command kubectl --kubeconfig (Get-KvaConfig).Kubeconfig get secrets -A
. If the command does not work It could be that the KMS is down. Can you try this workaround
Yes I have tried this workaround, but on step 3, as you can see at the screenshot, the container kms-plugin
has status Exited
So I cannot start the container. Also many other containers are in same status. This happened after 60 days.
I am thinking of applying the command Restart-AksHci
, but I think this will also remove the workload cluster, which is working fine, am I correct?
Yes, It would remove the workload cluster. We can try to understand what is causing the KMS plugin to crash by looking at the logs.
$ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> "sudo docker container ls | grep 'kms'"
The output should have the below fields
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
The output should look something like this
4169c10c9712 f111078dbbe1 "/kms-plugin" 22 minutes ago Up 22 minutes k8s_kms-plugin_kms-plugin-moc-lll6bm1plaa_kube-system_d4544572743e024431874e6ba7a9203b_1
We can get the logs with
$ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> "sudo docker container logs <Container ID>"
Hello,
Because the container runtime is "containerd" & not "Docker", the right command is: crictl stop <Container ID> && crictl start <Container ID>
The issue now is resolved.
Now the problem is that it cannot communicate with the workload cluster:
Also when trying to run Update-AksHci
the error message is:
This message is appearing for about 12 hours, so I assume that a process have been stuck.
The workload cluster is working fine, I checked it with kubectl cluster-info
, & all the workloads & the etcd Leader are up.
Should I open new issue? Or you can suggest me what to look for troubleshooting?
So is the management cluster api server accessible ? Can you try running this command
kubectl --kubeconfig (Get-KvaConfig).Kubeconfig get nodes
Yes the management cluster API Server is accessible. I attach the commands that have run for both management & workload clusters:
Hi @gianniskt
I was like you in the mud with another issue 148 With a bit help from Zach I made it with following steps:
kubectl get machines -A --kubeconfig (Get-AksHciConfig).Kva.kubeconfig
showed me the old management cluster VM being stuck at deleting with kube-apiserver and etcd in a crash loop.
I did a manual shutdown of the old VM in Hyper-V -> After that it had been deleted automatically, but I had the new management cluster VM and Get-AksHciVersion
showed me the old version 1.0.4 and Update-AksHci
the same error message 'update in progress'
Zach asked me to set the state manually to "UpdateFailed" to overcome this exception - I've modified AksHci.psm to
Running Update-AksHci
again and it successfully finished the update (without creating a new management cluster VM)
@raghavendra-nataraj may this helps here as well to finalize the update?
Hello @Elektronenvolt
The error for my case is different:
As I understand, it is trying to provision a new management control-plane VM, and it is stuck in provisioning. I cannot see at the HyperVisor any new VM that is being created, or any job related to this VM Name.
I also tried your workaround but the result of Update-AksHci
is the same Update is already in progress
.
@gianniskt I see, I thought you're hanging on tearing down the old VM.
Thanks @gianniskt, we may need logs to debug this issue further. We are working on the process for sharing logs. @Elektronenvolt thanks for steps. It looks like issue here is the vm is not getting created. So it may not be the right workaround here.
@raghavendra-nataraj Do you need log zip file extracted from command Get-AksHciLogs
?
The file is almost 2.5 GB. Is there maybe any specific folder of the Log File that you may need?
Yes, I would need that. We are looking at the process to get that. Meanwhile can you check for caph logs in folder "\Kva\clusterlogs\caph-system\caph-controller-manager-*\logs".
Can you also run the kubectl get event --kubeconfig (Get-AksHciConfig).Kva.kubeconfig
and paste the output
kubectl get event --kubeconfig (Get-AksHciConfig).Kva.kubeconfig -A
You can find attached the output of the above command.
As I can see, the image file regarding the provisioning of management control plane VM, does not exist.
does this image exist -> C:\ClusterStorage\Volume1\AksHCI\ImageStore\dfa6f14521777d5\Linux_k8s_1-21-2.vhdx
Can you get the output of the cmdlet Get-MocGalleryImage -Location MocLocation
The output of the command is:
However I cannot see the image file at the path C:\ClusterStorage\Volume1\AksHCI\ImageStore\dfa6f14521777d5\Linux_k8s_1-21-2.vhdx
.
This is the image that exists at the folder:
Not sure how that image went missing, but lets try to restore it, to let the upgrade continue. Can you check if you find Linux_k8s_1-21-2.vhdx is in the parent dir?
If so, can you copy it here (C:\ClusterStorage\Volume1\AksHCI\ImageStore\dfa6f14521777d5\Linux_k8s_1-21-2.vhdx)
Yes, that was my first action after checking this error, but I cannot find this file anywhere.
Okay, lets try the below then
$VerbosePreference = "continue"
Remove-MocGalleryImage -location MocLocation -Name Linux_k8s_1-21-2
ipmo kva.psm1
Add-KvaGalleryImage -kubernetesVersion v1.21.2
If this succeeds, you should see the upgrade progressing and new vms getting created. Upgrade should be done in 15 minutes
If you are stuck in Upgrade is in Progress, do take a look at the below issue to see how to fix it and try again.
https://github.com/Azure/aks-hci/issues/152
But do this only if there isnt any upgrade window that is in progress via PS or WAC.
The download of the image has been completed successfully.
The old management cluster VM has been removed automatically, and a new one has been created:
However the same error is displayed:
The new log file is attached:
Also, I tried the workaround https://github.com/Azure/aks-hci/issues/152
. Now the output is:
Can you check if below command works? Share the output, only if the command fails with an error.
kubectl get secrets --kubeconfig (Get-AksHciConfig).Kva.kubeconfig
@raghavendra-nataraj , do you know what this error means. Appliance is reachable, but these cmds are failing
@madhanrm Yes the command kubectl get secrets --kubeconfig (Get-AksHciConfig).Kva.kubeconfig
is working.
trying to find out what the next steps are.
Also to provide more information, this is the output of command Get-AksHciBillingStatus
However, the 2 clusters (Management & Workload) are shown connected at Azure-Arc.
Furthermore, you can see the structure at the AksHci folder:
The 1.0.3.10901
is the current version of AksHci, and the 1.0.4.10928
is the next version that Update-AksHci
was trying to apply.
Also the old Management Cluster VM is appeared on Hyper-V Manager, but it is not shown on command kubectl get machines -A --kubeconfig (Get-AksHciConfig).Kva.kubeconfig
@gianniskt to share the logs, can you use your Azure Subscription to create a support request: https://docs.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request. Make sure you choose Azure Kubernetes Service on Azure Stack HCI in the service type so that your issue is correctly routed. In Step 3: Additional details you'll have option to upload a file.
Ok, support request has been created.
I posted also a question at Q&A: https://docs.microsoft.com/en-us/answers/questions/635831/cannot-update-akshci-cluster-even-all-nodes-are-up.html
@gianniskt could you share the support ticket # if you still have it by any chance
@baziwane Support request number: 2111220050001803
We decided to uninstall the AKS-HCI, and install it again, so the issue will be closed.
@mattbriggs
Describe the bug KVA certificate expires after 60 day if no upgrade is performed.
Expected behavior Update-AksHci would throw the below error. Any command involving kvactl will throw the below error.
Error: failed to get new provider: failed to create azurestackhci session: Certificate has expired: Expired
Environment:
Solution Delete the expired certificate file at the below location and try Update again $env:UserProfile.wssd\kvactl\cloudconfig