Azure / aksArc

# Welcome to the Azure Kubernetes Service on Azure Stack HCI repo This is where the AKS-HCI team will track features and issues with AKS-HCI. We will monitor this repo in order to engage with our community and discuss questions, customer scenarios, or feature requests. Checkout our projects tab to see the roadmap for AKS-HCI!
MIT License
109 stars 45 forks source link

[BUG] AksHci upgrade hangs when the upgrade is initiated after adding a physical node to the setup #288

Open in3xes opened 1 year ago

in3xes commented 1 year ago

Describe the bug AksHci upgrade hangs if we initiate the upgrade after adding a new physical to an existing AksHci setup. The hang typically looks as shown in the screenshot below. upgrade_hang

The CSI controller pod logs has the following error. csi_logs

To Reproduce Steps to reproduce the behavior:

  1. Install AksHci
  2. Add a new physical node
  3. Perform AksHci upgrade with Update-AksHci. The upgrade hangs during this step

Expected behavior Ideally, the upgrade should complete without any issues.

Mitigation

  1. Drain the node using failover cluster UI as shown below. fc_tsg Alternatively, you can use the command Suspend-ClusterNode -Name <nodename> -Drain to drain the node.
  2. Use Remove-AksHciNode -nodeName <nodeName> to remove the machine from akshci setup
  3. Use Remove-ClusterNode -Name <nodeName> to remove the machine from failover cluster
  4. Run Update-AksHci to trigger the upgrade.

Note: We can remove the node while the upgrade is hanging. The upgrade will proceed.