Open jomeier opened 3 years ago
https://bugzilla.redhat.com/show_bug.cgi?id=1852106 https://bugzilla.redhat.com/show_bug.cgi?id=1853750
seems to be related to this issue.
If you are using a bridge network you may need to create a slave device for your NIC and disable the configuration you had for the primary NIC.
Thanks to Charro for that idea.
On Mar 4, 2021, at 12:30 PM, Josef Meier notifications@github.com wrote:
CAUTION: This email originated from outside of BCIT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
https://bugzilla.redhat.com/show_bug.cgi?id=1852106https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1852106&data=04%7C01%7Cbruce_link%40bcit.ca%7C69fb94431cf24069a56908d8df4c6071%7C8322cefd0a4c4e2cbde5b17933e7b00f%7C0%7C0%7C637504866409142012%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2BBhuBywoLw4A9L0BzVhI20MFwhE6YIXUY78ohwpVyyw%3D&reserved=0 https://bugzilla.redhat.com/show_bug.cgi?id=1853750https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1853750&data=04%7C01%7Cbruce_link%40bcit.ca%7C69fb94431cf24069a56908d8df4c6071%7C8322cefd0a4c4e2cbde5b17933e7b00f%7C0%7C0%7C637504866409142012%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oXaXOjgRd5k%2BOnmJjvK4tLZnURWvykjmNgoNWlcd4L4%3D&reserved=0
seems to be related to this issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcoreos%2Ffedora-coreos-tracker%2Fissues%2F757%23issuecomment-790919995&data=04%7C01%7Cbruce_link%40bcit.ca%7C69fb94431cf24069a56908d8df4c6071%7C8322cefd0a4c4e2cbde5b17933e7b00f%7C0%7C0%7C637504866409152003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6Rt3Znpub%2FMjTya5zHwE%2BoHjKLw7XrwREc8dCOk5whA%3D&reserved=0, or unsubscribehttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAEBGEZONPUH5W64IG2VM3LTB7UW7ANCNFSM4YT3GB4Q&data=04%7C01%7Cbruce_link%40bcit.ca%7C69fb94431cf24069a56908d8df4c6071%7C8322cefd0a4c4e2cbde5b17933e7b00f%7C0%7C0%7C637504866409152003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CfJUV0Jyybe54FJHSDiCaUF1cTH93E62QZC3Go9Ifl4%3D&reserved=0.
@bdlink Hi Bruce,
I don't completely understand what you mean.
How can we make FCOS (and its NetworkManager) work together with OVNKubernetes, the default Network Plugin for OKD 4?
With "bridge" I mean the br-ex bridge interface created by OVNKubernetes, not an external bridge. If this is, what you mean here.
Greetings,
Josef
Sorry, I misunderstood which bridge you were referring to. I was referring to an external bridge.
Bruce
On Mar 4, 2021, at 12:54 PM, Josef Meier notifications@github.com wrote:
CAUTION: This email originated from outside of BCIT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
I don't completely understand what you mean.
How can we make FCOS (and its NetworkManager) work together with OVNKubernetes, the default Network Plugin for OKD 4?
With "bridge" I mean the br-ex bridge interface created by OVNKubernetes, not an external bridge. If this is, what you mean here.
Greetings,
Josef
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcoreos%2Ffedora-coreos-tracker%2Fissues%2F757%23issuecomment-790934023&data=04%7C01%7Cbruce_link%40bcit.ca%7Cdbe2e399efab41587ee208d8df4fa177%7C8322cefd0a4c4e2cbde5b17933e7b00f%7C0%7C0%7C637504880423063992%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OxUQlH9Esk1LXZZbyOfBQQDb2cUCswnMkpZRITaEfDk%3D&reserved=0, or unsubscribehttps://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAEBGEZUDHPTIZL7INPVUKLTB7XOLANCNFSM4YT3GB4Q&data=04%7C01%7Cbruce_link%40bcit.ca%7Cdbe2e399efab41587ee208d8df4fa177%7C8322cefd0a4c4e2cbde5b17933e7b00f%7C0%7C0%7C637504880423063992%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FT3AU7qUfKA%2FatkEW57zusBxydtp12DIzxkHAsJAf0M%3D&reserved=0.
I think this is a NM bug fixed by https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/557. It landed in 1.27 and thus it will get fixed when we rebase to F34, which contains NM 1.30.
will this fix be available for FCOS33?
@lucab Current OKD 4 versions use NM 1.26 and have this bug. It would be awesome to have a fixed NM in FCOS 33 also.
I understand from @vrutkovs that a newer NM package must get in the f33 RPM repos. Is that possible?
The problem with random MAC addresses occurs frequently with promiscuous mode turned on. I've not seen it in my test setup with promiscuous mode turned off.
I'm having a similar issue except I don't seem to have the bouncing MAC addresses. The issue I see is that occasionally, I do not get a DHCP address at all, or I get an IPv6 address... If I shutdown the instance for a while (2-5 minutes) I will usually get the correct IP address back although sometimes it takes several iterations of shutdowns or restarts.
This is OKD 4.7
This happens on both control plane and worker nodes
$rpm -qa NetworkManager*
NetworkManager-libnm-1.26.6-1.fc33.x86_64
NetworkManager-1.26.6-1.fc33.x86_64
NetworkManager-tui-1.26.6-1.fc33.x86_64
NetworkManager-team-1.26.6-1.fc33.x86_64
Actually, I think I am seeing the multiple MAC address issue:
In the case below (a failure) the MAC in the first line does not match the Client-Ethernet-Address ( 00:50:56:84:da:8a != d2:7e:ee:6d:01:78)
17:28:14.971637 B 00:50:56:84:da:8a ethertype IPv4 (0x0800), length 326: (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 310)
0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from d2:7e:ee:6d:01:78, length 282, xid 0xa1b1d4da, secs 192, Flags [none] (0x0000)
Client-Ethernet-Address d2:7e:ee:6d:01:78
Vendor-rfc1048 Extensions
Magic Cookie 0x63825363
DHCP-Message Option 53, length 1: Discover
Client-ID Option 61, length 7: ether d2:7e:ee:6d:01:78
Parameter-Request Option 55, length 17:
Subnet-Mask, Time-Zone, Domain-Name-Server, Hostname
Domain-Name, MTU, BR, Classless-Static-Route
Default-Gateway, Static-Route, YD, YS
NTP, Option 119, Classless-Static-Route-Microsoft, Option 252
RP
MSZ Option 57, length 2: 576
Requested-IP Option 50, length 4: 10.102.5.152
In the second case (successful DHCP) the MAC addresses match (00:50:56:84:da:8a == 00:50:56:84:da:8a)
17:29:14.969901 B 00:50:56:84:da:8a ethertype IPv4 (0x0800), length 326: (tos 0xc0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 310)
0.0.0.0.bootpc > 255.255.255.255.bootps: [udp sum ok] BOOTP/DHCP, Request from 00:50:56:84:da:8a, length 282, xid 0xf7b6dbcc, secs 1, Flags [none] (0x0000)
Client-Ethernet-Address 00:50:56:84:da:8a
Vendor-rfc1048 Extensions
Magic Cookie 0x63825363
DHCP-Message Option 53, length 1: Request
Client-ID Option 61, length 7: ether 00:50:56:84:da:8a
Parameter-Request Option 55, length 17:
Subnet-Mask, Time-Zone, Domain-Name-Server, Hostname
Domain-Name, MTU, BR, Classless-Static-Route
Default-Gateway, Static-Route, YD, YS
NTP, Option 119, Classless-Static-Route-Microsoft, Option 252
RP
MSZ Option 57, length 2: 576
Requested-IP Option 50, length 4: 10.102.5.152
The bad mac address does not appear in the logs at all
https://bugzilla.redhat.com/show_bug.cgi?id=1936961 has been opened
@lucab @LorbusChris I'm sitting before the source RPM package: https://kojipkgs.fedoraproject.org//packages/NetworkManager/1.26.4/1.fc33/src/NetworkManager-1.26.4-1.fc33.src.rpm
and it seems as if the mentioned fix is already included in NM 1.26 and therefore also in FCOS 33.
Could someone double check that, please?
If the fix is already included, it seems either not to work (under all circumstances) or we have a different problem.
Here are two log files with logs I got with
journalctl -u NetworkManager
correct-ip.txt : shows the logs if the node gets the correct IP address with sending the correct MAC address of the vSphere VM to the DHCP server wrong-ip.txt : shows the logs if the node gets the wrong IP address sending a random MAC address to the DHCP server
In my setup this seems to solve the problem:
The solution seems to be to write this file to all nodes:
# sudo vi /etc/systemd/network/98-ovs-mac.link
[Match]
Driver=openvswitch
[Link]
MACAddressPolicy=none
Hi my dear Fedoraoers,
today we found a problem that made us crazy with FCOS and OKD 4, but it explains a lot effects we saw in the last days.
*OKD 4.6 - OVNKubernetes on vSphere UPI: DHCP enabled, FCOS 33.20210117.3.2*
We had the problem, that sometimes after reboots of some of the masters the API server was not reachable. This morning we rebooted another master and the cluster was completely down. After a reboot of one of the masters the API was back again.
What happened?
We sshed in one of the masters and saw that ETCD was unhealthy. We checked the members and found out, that one of the members IP addresses did not belong to one of the three masters.
A colleague looked in the logs of our company's DHCP server and saw, that the master VMs that we rebooted made a DHCP request with alternating MAC addresses on each reboot. The DHCP server served two alternating IP addresses to the VM, and ETCD got unhealthy. Also OVNKubernetes Masters weren't able to start because they also tried to connect to the faulty IP address.
It seems that not only the ens192 network interface requests an IP address from the DHCP server but also some bridges(?). Maybe it is a race condition in NetworkManager.
Because promiscuous mode is enabled in our vSphere network the "false" MAC address is not discarded.
Do you have any ideas how we can overcome this problem? It regularly breaks our clusters. At least this time OVNKubernetes seems not to be guilty.
Thanks a lot and greetings,
Josef
@darkmuggle @dustymabe