Azure / AKS-Edge

Welcome to the Azure Kubernetes Service (AKS) Edge repo.
MIT License
57 stars 37 forks source link

[Question] Windows node networking #177

Open tmyroadctfig opened 7 months ago

tmyroadctfig commented 7 months ago

Describe scenario I've set up a single machine k3s cluster with a linux and Windows worker node, and deployed the sample applications:

kubectl apply -f  https://raw.githubusercontent.com/Azure/AKS-Edge/main/samples/others/linux-sample.yaml
kubectl apply -f https://raw.githubusercontent.com/Azure/AKS-Edge/main/samples/others/win-sample.yaml

Question I can't seem to access the Windows sample container service, either from the host machine, or from the linux pods. Is any further setup required to get the Windows worker node networking setup correctly?

PS C:\Users\luke> kubectl.exe get services -o wide
NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE   SELECTOR
azure-vote-back    ClusterIP      10.43.43.226    <none>        6379/TCP       18h   app=azure-vote-back
azure-vote-front   LoadBalancer   10.43.225.42    <pending>     80:30010/TCP   18h   app=azure-vote-front
kubernetes         ClusterIP      10.43.0.1       <none>        443/TCP        18h   <none>
sample             NodePort       10.43.124.239   <none>        80:31230/TCP   3s    app=sample
PS C:\Users\luke> curl -UseBasicParsing -Uri http://192.168.0.3:31230
curl : Unable to connect to the remote server
At line:1 char:1
+ curl -UseBasicParsing -Uri http://192.168.0.3:31230
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidOperation: (System.Net.HttpWebRequest:HttpWebRequest) [Invoke-WebRequest], WebExc
   eption
    + FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShell.Commands.InvokeWebRequestCommand

PS C:\Users\luke> Get-AksEdgeNodeAddr -NodeType Windows

[04/09/2024 00:04:31] Querying IP and MAC addresses from virtual machine (luke-testing-wedge)

 - Virtual machine MAC: 00:15:5d:00:05:0b
 - Virtual machine IP : 192.168.0.3 retrieved directly from virtual machine

Name                           Value
----                           -----
IpAddress                      192.168.0.3
MacAddress                     00:15:5d:00:05:0b

FWIW, running a web request from the Windows node to the linux node is working without any issues:

PS C:\inetpub\wwwroot> Invoke-WebRequest -Uri http://192.168.0.2:30010 -UseBasicParsing

StatusCode        : 200
StatusDescription : OK
Content           : <!DOCTYPE html>
                    <html xmlns="http://www.w3.org/1999/xhtml">
                    <head>
                        <link rel="stylesheet" type="text/css" href="/static/default.css">
                        <title>Azure Voting App</title>

                        <script language="Jav...
RawContent        : HTTP/1.1 200 OK
                    Connection: keep-alive
                    Content-Length: 950
                    Content-Type: text/html; charset=utf-8
                    Date: Mon, 08 Apr 2024 23:55:30 GMT
                    Server: nginx/1.13.7

                    <!DOCTYPE html>
                    <html xmlns="http://w...
Forms             :
Headers           : {[Connection, keep-alive], [Content-Length, 950], [Content-Type, text/html; charset=utf-8], [Date, Mon, 
                    08 Apr 2024 23:55:30 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        :
RawContentLength  : 950
haodeon commented 2 months ago

I am encountering the same issue if I restart the Windows node.

I am not able to connect on the Windows nodeport but the nodeport on Linux works. Looking into it further there appears to be missing HNS Endpoints after the node is restarted. Only the Outbound NAT endpoint for routing to the Linux node pod network appears which explains how it's able to connect on the Linux nodeport.

haodeon commented 2 months ago

Tested k8s with calico.

At first windows pod networking didn't work at all. Applied the registry fix from https://github.com/microsoft/Windows-Containers/issues/516

Then tested restarting the windows node. HNS Endpoints missing. Found https://github.com/projectcalico/calico/issues/5164. Deleted the service and deployment, reapplied the manifest and endpoints came back.

Retested k3s+flannel with the registry fix. Networking still broken after node restart. Doesn't matter if resources are deleted and reapplied.