raghavendra-nataraj commented 2 years ago

Describe the bug KVA certificate expires after 60 day if no upgrade is performed.

Expected behavior Update-AksHci would throw the below error. Any command involving kvactl will throw the below error. Error: failed to get new provider: failed to create azurestackhci session: Certificate has expired: Expired

Environment:

AKS-HCI Version (Any AksHci version on or before 1.0.4.10928)

Solution Delete the expired certificate file at the below location and try Update again $env:UserProfile.wssd\kvactl\cloudconfig

madhanrm commented 2 years ago

Should this be deleted on all nodes?

raghavendra-nataraj commented 2 years ago

Only on the node seeing the certificate expiry issue

gianniskt commented 2 years ago

The file $env:UserProfile.wssd\kvactl\cloudconfig contents are like this format:

Do you mean that we have to delete the block <SOME-CERTIFICATE-DATA> ? I tried this solution, and the above error is shown:

There is no other certificate file at this location.

raghavendra-nataraj commented 2 years ago

Please can you delete the entire file and try again.

gianniskt commented 2 years ago

Ok so I Deleted the "cloudconfig" file. After Update-AksHci a new "cloudconfig" file is generated, and the below error is shown:

raghavendra-nataraj commented 2 years ago

Looking at the other error in the bug you posted it looks like the KMS is down as it crossed the 60 day expiry. Can you do verify that by running the below command kubectl --kubeconfig (Get-KvaConfig).Kubeconfig get secrets -A. If the command does not work It could be that the KMS is down. Can you try this workaround

gianniskt commented 2 years ago

Yes I have tried this workaround, but on step 3, as you can see at the screenshot, the container kms-plugin has status Exited

So I cannot start the container. Also many other containers are in same status. This happened after 60 days. I am thinking of applying the command Restart-AksHci , but I think this will also remove the workload cluster, which is working fine, am I correct?

raghavendra-nataraj commented 2 years ago

Yes, It would remove the workload cluster. We can try to understand what is causing the KMS plugin to crash by looking at the logs. $ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> "sudo docker container ls | grep 'kms'" The output should have the below fields CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES The output should look something like this 4169c10c9712 f111078dbbe1 "/kms-plugin" 22 minutes ago Up 22 minutes k8s_kms-plugin_kms-plugin-moc-lll6bm1plaa_kube-system_d4544572743e024431874e6ba7a9203b_1 We can get the logs with $ ssh -i (Get-AksHciConfig).Moc.sshPrivateKey clouduser@<ip> "sudo docker container logs <Container ID>"

gianniskt commented 2 years ago

Hello,

Because the container runtime is "containerd" & not "Docker", the right command is: crictl stop <Container ID> && crictl start <Container ID> The issue now is resolved. Now the problem is that it cannot communicate with the workload cluster:

get-akshciupdates

Also when trying to run Update-AksHci the error message is:

This message is appearing for about 12 hours, so I assume that a process have been stuck.

The workload cluster is working fine, I checked it with kubectl cluster-info, & all the workloads & the etcd Leader are up.

Should I open new issue? Or you can suggest me what to look for troubleshooting?

raghavendra-nataraj commented 2 years ago

So is the management cluster api server accessible ? Can you try running this command kubectl --kubeconfig (Get-KvaConfig).Kubeconfig get nodes

gianniskt commented 2 years ago

Yes the management cluster API Server is accessible. I attach the commands that have run for both management & workload clusters:

Elektronenvolt commented 2 years ago

Hi @gianniskt

I was like you in the mud with another issue 148 With a bit help from Zach I made it with following steps:

kubectl get machines -A --kubeconfig (Get-AksHciConfig).Kva.kubeconfig showed me the old management cluster VM being stuck at deleting with kube-apiserver and etcd in a crash loop.

I did a manual shutdown of the old VM in Hyper-V -> After that it had been deleted automatically, but I had the new management cluster VM and Get-AksHciVersion showed me the old version 1.0.4 and Update-AksHci the same error message 'update in progress'

Zach asked me to set the state manually to "UpdateFailed" to overcome this exception - I've modified AksHci.psm to

Running Update-AksHci again and it successfully finished the update (without creating a new management cluster VM) @raghavendra-nataraj may this helps here as well to finalize the update?

gianniskt commented 2 years ago

Hello @Elektronenvolt

The error for my case is different:

As I understand, it is trying to provision a new management control-plane VM, and it is stuck in provisioning. I cannot see at the HyperVisor any new VM that is being created, or any job related to this VM Name.

I also tried your workaround but the result of Update-AksHci is the same Update is already in progress.

Elektronenvolt commented 2 years ago

@gianniskt I see, I thought you're hanging on tearing down the old VM.

raghavendra-nataraj commented 2 years ago

Thanks @gianniskt, we may need logs to debug this issue further. We are working on the process for sharing logs. @Elektronenvolt thanks for steps. It looks like issue here is the vm is not getting created. So it may not be the right workaround here.

gianniskt commented 2 years ago

@raghavendra-nataraj Do you need log zip file extracted from command Get-AksHciLogs ? The file is almost 2.5 GB. Is there maybe any specific folder of the Log File that you may need?

raghavendra-nataraj commented 2 years ago

Yes, I would need that. We are looking at the process to get that. Meanwhile can you check for caph logs in folder "\Kva\clusterlogs\caph-system\caph-controller-manager-*\logs". Can you also run the kubectl get event --kubeconfig (Get-AksHciConfig).Kva.kubeconfig and paste the output

madhanrm commented 2 years ago

kubectl get event --kubeconfig (Get-AksHciConfig).Kva.kubeconfig -A

gianniskt commented 2 years ago

aks_hci_event_logs.txt

You can find attached the output of the above command.

As I can see, the image file regarding the provisioning of management control plane VM, does not exist.

madhanrm commented 2 years ago

does this image exist -> C:\ClusterStorage\Volume1\AksHCI\ImageStore\dfa6f14521777d5\Linux_k8s_1-21-2.vhdx

madhanrm commented 2 years ago

Can you get the output of the cmdlet Get-MocGalleryImage -Location MocLocation

gianniskt commented 2 years ago

The output of the command is:

However I cannot see the image file at the path C:\ClusterStorage\Volume1\AksHCI\ImageStore\dfa6f14521777d5\Linux_k8s_1-21-2.vhdx .

This is the image that exists at the folder:

madhanrm commented 2 years ago

Not sure how that image went missing, but lets try to restore it, to let the upgrade continue. Can you check if you find Linux_k8s_1-21-2.vhdx is in the parent dir?

If so, can you copy it here (C:\ClusterStorage\Volume1\AksHCI\ImageStore\dfa6f14521777d5\Linux_k8s_1-21-2.vhdx)

gianniskt commented 2 years ago

Yes, that was my first action after checking this error, but I cannot find this file anywhere.

madhanrm commented 2 years ago

Okay, lets try the below then

$VerbosePreference = "continue"
Remove-MocGalleryImage -location MocLocation  -Name Linux_k8s_1-21-2
ipmo kva.psm1
Add-KvaGalleryImage -kubernetesVersion v1.21.2

madhanrm commented 2 years ago

If this succeeds, you should see the upgrade progressing and new vms getting created. Upgrade should be done in 15 minutes

madhanrm commented 2 years ago

If you are stuck in Upgrade is in Progress, do take a look at the below issue to see how to fix it and try again.

https://github.com/Azure/aks-hci/issues/152

But do this only if there isnt any upgrade window that is in progress via PS or WAC.

gianniskt commented 2 years ago

The download of the image has been completed successfully.

The old management cluster VM has been removed automatically, and a new one has been created:

However the same error is displayed:

The new log file is attached:

aks_hci_new_event_logs.txt

Also, I tried the workaround https://github.com/Azure/aks-hci/issues/152 . Now the output is:

madhanrm commented 2 years ago

Can you check if below command works? Share the output, only if the command fails with an error.

kubectl get secrets --kubeconfig (Get-AksHciConfig).Kva.kubeconfig

madhanrm commented 2 years ago

@raghavendra-nataraj , do you know what this error means. Appliance is reachable, but these cmds are failing

gianniskt commented 2 years ago

@madhanrm Yes the command kubectl get secrets --kubeconfig (Get-AksHciConfig).Kva.kubeconfig is working.

madhanrm commented 2 years ago

trying to find out what the next steps are.

gianniskt commented 2 years ago

Also to provide more information, this is the output of command Get-AksHciBillingStatus

However, the 2 clusters (Management & Workload) are shown connected at Azure-Arc.

Furthermore, you can see the structure at the AksHci folder:

The 1.0.3.10901 is the current version of AksHci, and the 1.0.4.10928 is the next version that Update-AksHci was trying to apply.

Also the old Management Cluster VM is appeared on Hyper-V Manager, but it is not shown on command kubectl get machines -A --kubeconfig (Get-AksHciConfig).Kva.kubeconfig

baziwane commented 2 years ago

@gianniskt to share the logs, can you use your Azure Subscription to create a support request: https://docs.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request. Make sure you choose Azure Kubernetes Service on Azure Stack HCI in the service type so that your issue is correctly routed. In Step 3: Additional details you'll have option to upload a file.

gianniskt commented 2 years ago

Ok, support request has been created.

I posted also a question at Q&A: https://docs.microsoft.com/en-us/answers/questions/635831/cannot-update-akshci-cluster-even-all-nodes-are-up.html

baziwane commented 2 years ago

@gianniskt could you share the support ticket # if you still have it by any chance

gianniskt commented 2 years ago

@baziwane Support request number: 2111220050001803

gianniskt commented 2 years ago

We decided to uninstall the AKS-HCI, and install it again, so the issue will be closed.

mattbriggs commented 2 years ago

@mattbriggs

Azure / aksArc

[BUG] Kva certificate needs to be deleted if Kva Certificate expired after 60 days #146

in-progress