antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.67k stars 370 forks source link

Running antrea on a windows node gets stuck while waiting for data path #6568

Open mkaring opened 4 months ago

mkaring commented 4 months ago

I'm trying to use antrea to get the networking in my cluster going.

I already got some help on Slack. For reference, the thread is here.

I currently got two nodes:

I currently got Antrea V2.1.0 running on the cluster. Kubernetes is installed with version 1.29.7. The installation happed based on the documentation of antrea. The control-plane part was installed using helm, with the default values. The windows node was setup using the Prepare-Node.ps1 script and the antrea-windows-with-ovs.yml from the v2.1.0 release. The only change done to this file is setting the kubeAPIServerOverride.

I was not absolutely sure on how to set this last option up. Currently I got it like this: https://111.222.333.444:6443. This is the exact output of kubectl config view -o jsonpath='{.clusters[0].cluster.server}'. The documentation here shows an example without the https:// part. So I'm not sure what's correct. I tried both, does not seem to make a difference.

The issue I'm seeing is the antrea-agent-windows pod on the Windows Node not starting up and showing the following error: https://gist.github.com/mkaring/d49ade0daa1a03f58a5b919ede6829f2

According to @antoninbas (Thanks again for the help in Slack) it may be helpful to look at the conf.db that is created by Openvswitch. I attached it just in case: conf.db.txt. Accessing the information directly using ovs-vsctl seems to be difficult with OpenSwitch running inside a container.

The cluster I'm running is just for testing right now. If you want me to try anything to get to the bottom of this, just tell me. If you need any additional information, I'll gladly provide.

Thank you in advance, Martin

antoninbas commented 4 months ago

Could you confirm the following:

Also, could you share the log files which are under C:\openvswitch\var\log\openvswitch?

I am also experiencing a similar issue with Windows Server 2022, so I have asked @wenyingd for some info. Edit: I had a misconfiguration in my test environment. Things are working as expected.

wenyingd commented 4 months ago

Accessing the information directly using ovs-vsctl seems to be difficult with OpenSwitch running inside a container.

You could try with this powershell command to ensure the OVS utilities path are added into the current shell,

$env:PATH=[System.Environment]::GetEnvironmentVariable("PATH", "Machine")

Then you can try to run OVS commands like ovs-vsctl.exe show or ovs-ofctl.exe -OOpenFlow15 dump-flows br-int

mkaring commented 4 months ago

@antoninbas

Logs: C:\openvswitch\var\logs does not exist. var\ only contains run as subdirectory.

@wenyingd

The command is in the path. The idea you mentioned in Slack is the reason. The command does not work over a powershell remoting connection. It works fine using an RDP connection.

> ovs-vsctl.exe show
0416ab53-99a7-4ac0-b537-ae62ec5bf25e
    Bridge br-int
        datapath_type: system
    ovs_version: "3.0.5"
> ovs-ofctl.exe -OOpenFlow15 dump-flows br-int
ovs-ofctl: br-int is not a bridge or a socket

I'm not sure about the second message. I can see the br-int interface in the adapter overview of windows, also in the Hyper-V manager.

antoninbas commented 4 months ago

@mkaring as long as you are using the OVS driver provided by Antrea (hosted at https://downloads.antrea.io), you will need to enable test-signed drivers. We do not provide a driver signed with a certificate from a trusted root authority. This could explain why initialization is failing.

antoninbas commented 3 months ago

@mkaring as long as you are using the OVS driver provided by Antrea (hosted at https://downloads.antrea.io), you will need to enable test-signed drivers. We do not provide a driver signed with a certificate from a trusted root authority. This could explain why initialization is failing.

@wenyingd is there a way to confirm that this is causing the issue? I was assuming that Install-OVS.ps1 would fail in that case (causing the initContainer to fail), but maybe that's not the case.

mkaring commented 3 months ago

This is going to be a real problem. Are there any known providers for OVS that provide WHQL signed drivers? My server is set to boot using secure boot and that is something I can't change. Test-signed drivers and secure boot do not like each other.

wenyingd commented 3 months ago

@wenyingd is there a way to confirm that this is causing the issue? I was assuming that Install-OVS.ps1 would fail in that case (causing the initContainer to fail), but maybe that's not the case.

If my memory is correct, the step to install OVS driver can succeed, but the workload is impacted when antrea tries to enable the Extension (ovsext) on a VMSwitch on the host. I could have some try.

antoninbas commented 3 months ago

This is going to be a real problem. Are there any known providers for OVS that provide WHQL signed drivers? My server is set to boot using secure boot and that is something I can't change. Test-signed drivers and secure boot do not like each other.

AFAIK, not for free / with an open license. VMware provides one for customers. Antrea also expects a specific driver version for OVS, which we test with (and sometimes includes some necessary patches). You can also get your own release signature.

antoninbas commented 3 months ago

@wenyingd it would be good to fail early or at least have a way to clearly identify that it is the issue

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days