Closed jaiawasthi closed 4 years ago
Fixed the title, predictable names are based on path and device etc., i.e. en.... and persistent names are the "legacy" names such as eth0, eth1 etc.
@jaawasth you say OS upgrade can you be more specific please? I sthis an upgrade between service packs, i.e. SLES 15 to SLES 15 SP1 or is this a distribution upgrade, i.e. SLES 12 SP5 to SLES 15 SP1?
Names are set by udev rules and udev is part of systemd. So we need to know starting and end point. Also please file a bug in bugzilla so we can involve other teams at SUSE if this turns out to be a udev/systemd issue.
Thanks
@rjschwei thanks for the update !! I still don;t have complete clarity, I'm getting it from customer engagement team. But from what they have described is that the customer patched the system [i believe they did a kernel upgrade/refresh ], I'll update the exact process that has been done. So the upgrade was for a SLES15 SP1 VLI image [i think it's just a patching of OS]
Names are set by udev rules and udev is part of systemd.
But do the udev rules always write out a 70-persistent-net.rules file. If there is some other way we are adding the "ID_NET_NAME_PATH", is it safe to add this rule still in the persistent-net.rules file ?
So, it does look, that the customer just did a kernel upgrade. After the kernel upgrade, the interface names switched to
From Dmesg: [ 162.098760] mlx5_core 0000:41:00.0 ens2370f0: renamed from eth4 [ 162.153688] mlx5_core 0000:41:00.1 ens2370f1: renamed from eth5 [ 162.364046] mlx5_core 0001:c1:00.0 enP1s2372f0: renamed from eth6 [ 162.434419] mlx5_core 0001:c1:00.1 enP1s2372f1: renamed from eth7
Please note, we expect the names to be predictable, so the interface names should be
enp65s0f0 - eth4 enp65s0f1 - eth5 enP1p193s0f0 - eth6 enP1p193s0f1 - eth7
The kernel has been upgraded from 4.12.14-197.45-default ----> 4.12.14-197.56-default
As a workaround to allow customer to continue having access to servers, following rules were added to 70-persistent-net.rules file
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses1", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth4" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses2", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth5" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses3", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth6" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses4", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth7"
Please note these are the mac addresses of the network devices.
The issue is reproducible locally in my environemnt now, if i do a zypper ref & zypper up. The interface names are changing. What were they before
sdflex01:~ # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: enp195s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:00 brd ff:ff:ff:ff:ff:ff 3: enp195s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:01 brd ff:ff:ff:ff:ff:ff 4: enp195s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:02 brd ff:ff:ff:ff:ff:ff 5: enp195s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:03 brd ff:ff:ff:ff:ff:ff 6: enp65s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 7: enp65s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 8: enP1p193s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 9: enP1p193s0f1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond1 state DOWN group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 11: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 12: vlan210@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff inet 10.100.0.179/24 brd 10.100.0.255 scope global vlan210 valid_lft forever preferred_lft forever 13: vlan211@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff inet 10.20.211.179/24 brd 10.20.211.255 scope global vlan211 valid_lft forever preferred_lft forever 14: vlan213@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff inet 10.20.213.179/24 brd 10.20.213.255 scope global vlan213 valid_lft forever preferred_lft forever 15: vlan212@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff inet 10.20.212.179/24 brd 10.20.212.255 scope global vlan212 valid_lft forever preferred_lft forever
What they change to
sdflex01:~ # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: enp195s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:00 brd ff:ff:ff:ff:ff:ff 3: enp195s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:01 brd ff:ff:ff:ff:ff:ff 4: enp195s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:02 brd ff:ff:ff:ff:ff:ff 5: enp195s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:03 brd ff:ff:ff:ff:ff:ff 6: ens2498f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 7: ens2498f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 8: enP1s2500f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:97:9c brd ff:ff:ff:ff:ff:ff 9: enP1s2500f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:97:9d brd ff:ff:ff:ff:ff:ff
Please note only the "UP" interfaces are impacted and not the down ones, so the "UP" interfaces are changing names which is why the network is going down. Interestingly, "DOWN" interfaces as still based on the pci buses names and slots rather than the pci bus slots.
Hmm, maybe an update of udev is causing this. So our images comes with the following setting:
/usr/lib/systemd/network/99-default.link
[Link]
NamePolicy=path
MACAddressPolicy=persistent
Is that file changed after you ran the zypper up
?
If yes please open a bug against systemd-maintainers@suse.de
and ask how to configure the system permanently for the setting above such that it will survive an update process.
Thanks
Thanks @schaefi , i did see both systemd & udev being upgraded but i didn't have a look at this file after wards, let me check again, I'll update you soon, will need to setup the system again.
@schaefi, I have created a bug against systemd-maintainers.
the name policy is getting changed to kernel [Link] NamePolicy=kernel database onboard slot path MACAddressPolicy=persistent
ok so it's clear why the issue happened. Now we only need a solution. If you get further information from the bug please send a short notice in this report. I'm not looking on bugzilla as often as I look/get-notified here. Thanks
@schaefi i thikn changing the naming policy to "path" would be fine for fixing this issue temporarily, i have tried it and it works fine, but just want to ensure that's the only thing we need to do.
NamePolicy= path
Thanks !!
want to ensure that's the only thing we need to do.
yes that's the fix you can apply manually to make the system work again. A real fix however must be done differently. I saw there was no response to the bug you have created so far. I guess we have to wait a little longer
@schaefi , thanks, also, I'm assuming the fixes will be in 2 parts.
@schaefi I'm afraid this is ours:
https://www.freedesktop.org/software/systemd/man/systemd.link.html
@jaiawasthi
There is nothing that can be done to prevent this on updated. Before update users should run, as root:
mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network
@schaefi @rjschwei just to be sure,
Also, this will prevent changed interface names across future updates ?
@jaiawasthi yes your thinking is correct. I will merge the changes from #244 today. This will result in new testing images which I can test in our environment as it's independent of your data center. If there is confirmation that the change really fixed it I will submit the images to the production(SUSE) namespace and will create another production image release.
Do you want a full set of new testing images to test in your space as well ?
Thanks
@schaefi since the changes involve essentially running just below 2 commands. mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network
I'll add these manually in our current setup and see if its working. Meanwhile please test in your setup & provide us the prod image directly. Thanks !!
ok :+1:
@schaefi @rjschwei i tried applying the steps mentioned before an upgrade mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network
sdflexOptanePart1:~ # cat /etc/systemd/network/99-default.link [Link] NamePolicy=path MACAddressPolicy=persistent
But the interfaces are not coming back up. Also, interfaces are named differently now sdflexOptanePart1:~ # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:ac brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:ad brd ff:ff:ff:ff:ff:ff 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:ae brd ff:ff:ff:ff:ff:ff 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:af brd ff:ff:ff:ff:ff:ff 6: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:32:84 brd ff:ff:ff:ff:ff:ff 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:32:85 brd ff:ff:ff:ff:ff:ff 8: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:22:d4 brd ff:ff:ff:ff:ff:ff 9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:22:d5 brd ff:ff:ff:ff:ff:ff
@schaefi , since this is not fixed, can we please repopen this issue, since I will not be able to do so.
yes of course, sorry should have done this already
Since there was an obs buildservice outage yesterday I will test the image with the change today. I expect the suggested fix to not work though, but wanted to double check
I can confirm the fix did not work. The interface name in my test was 'ens3' where it should be based on the mac address. Same applies to the images that use the path policy.
I'm going to revert everything done here.
sure that's what I'm currently doing. But your change broke everything because the namepolicy as we need is now not taken into account at all. I don't want to keep broken images around.
@schaefi @rjschwei did we find any solutions ?
Solution was found. The bug is in dracut. For details see
I have added the changes from our side in PR #246 But now we have to wait until dracut gets fixed and a new dracut package gets released. That's why I set the labels respectively on the open pull request. A merge can only be done after a dracut update, otherwise the change in the open PR has no effect
Thanks @schaefi !!
Note that users cannot update until the dracut (fixed read grub previously which was incorrect) has been fixed and then they need to follow the cp instructions posted earlier.
Note that users cannot update until the grub has been fixed and then they need to follow the cp instructions posted earlier.
@rjschwei I'm confused ?? how is grub related to this report ??
@rjschwei you probably meant dracut not grub, and that's noted on PR #246 and the reason why the blocked label is set. I hope Thomas will provide a testing package such that we can target the branch build for the testing images as long as the release is not done
@schaefi yes, comment fixed
We will fix this on the image description level without a dracut update.
Thanks Marcus, but did we need the settings for LI as well ?
yes everywhere. in LI we have namePolicy set to MAC in VLI we have namePolicy set to PATH. For all descriptions we used /usr/lib/systemd/network/99-default.link. This means all images have the potential to be broken on an update of udev when this file gets overwritten :/
So I will update all images for all SLE
@schaefi , is there any interim solution which the customer can apply before upgrading their OS's so that thet dont run into this issue [without actually upgrading to the new SLES image which you would be releasing] ?
yes. The current interim solution is:
cp /usr/lib/systemd/network/99-default.link /etc/systemd/network
echo 'install_items+=" /etc/systemd/network/99-default.link "' > /etc/dracut.conf.d/03-systemd.conf
dracut -f
This makes the setting permanent and update safe
But please wait with communicating this because our systemd maintainers doesn't like it. We are currently discussing other options
@schaefi , sure, i though we reached some agreement and you raised a PR for that ?
Yes and all this is working. The PR is based on what I wrote in the interim solution and has been tested and also submitted to the SUSE namespace for a production release. I just need to click the button and send it to you.
But now people from the systemd maintainer team claimed that using /etc/systemd/network is not the right place to keep the config because it should be used for local modifications only. They suggested to use a higher prio named file and put it to /usr/lib/systemd/network. I prepared a test appliance and demonstrated that this does not work. This is the place where we are right now.
I'd like to give them another day or two to elaborate on this and once it's clear that I didn't do something completely stupid I will go and start the production release process with the solution from here.
If this is blocking you in some way please let me know
Thanks
ok here is the result of my conversation with the systemd people. @jaawasth you can use the following as interim solution until we roll out new production images
cp /usr/lib/systemd/network/99-default.link /usr/lib/systemd/network/80-azure-li-net.link
dracut -f
NOTE: It's absolutely mandatory that you stick with the name 80-azure-li-net.link because that makes the sort order to be correct for systemd
Thanks Marcus, I'll test out and let you know as well. Will then recommend these to the customers.
@schaefi i tried testing today, sorry all the servers were busy in some testing. And i tried it with image SLES15-SP1-SAP-Azure-VLI-BYOS.x86_64-1.0.5-Production-Build1.127.raw.xz Is the behavior not consistent across images ?
The behavior is interesting, see we dont see the name policy as path in the image rather kernel I still created the file /usr/lib/systemd/network/80-azure-li-net.link with name policy as path and updated the server, it still lost the interface.
interface: enp65s0f0 mtu: 9000
interface: enp65s0f1 mtu: 9000
interface: enP1p193s0f0 mtu: 9000
interface: enP1p193s0f1
sdflex02:~ # ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: enp195s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:00 brd ff:ff:ff:ff:ff:ff 3: enp195s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:01 brd ff:ff:ff:ff:ff:ff 4: enp195s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:02 brd ff:ff:ff:ff:ff:ff 5: enp195s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:03 brd ff:ff:ff:ff:ff:ff sdflex02:~ # cat /usr/lib/systemd/network/80-azure-li-net.link [Link] NamePolicy=path database onboard slot path MACAddressPolicy=persistent sdflex02:~ # cat /usr/lib/systemd/network/99-default.link [Link] NamePolicy=kernel database onboard slot path MACAddressPolicy=persistent sdflex02:~ # uname -a Linux sdflex02 4.12.14-197.61-default #1 SMP Thu Oct 8 11:04:16 UTC 2020 (b98c600) x86_64 x86_64 x86_64 GNU/Linux
As a side note I have released new production images today that will fix the issue, you should have an e-mail with further details
Back to the issue here, I think there is a misunderstanding. Your file 80-azure-li-net.link looks wrong to me. The suggested interim solution is based on a system that has not yet run "zypper up". Meaning a system that has not yet replaced 99-default.link with an udev update. If your system has already installed the udev update you can't copy 99-default.link as it already was replaced with a version that is not sufficient for our use case.
So in case your system has already an udev update installed the following needs to be done to fixup the settings:
1a) On LI systems create /usr/lib/systemd/network/80-azure-li-net.link
with the following content:
[Link]
NamePolicy=mac
MACAddressPolicy=persistent
[Match]
OriginalName=*
1b) On VLI systems create /usr/lib/systemd/network/80-azure-vli-net.link
with the following content:
[Link]
NamePolicy=path
MACAddressPolicy=persistent
[Match]
OriginalName=*
2) Call dracut
```
dracut -f
```
3) reboot the system
The easiest solution is to deploy the images that got released today. But if you need to fixup running servers it should be done in the above way
Hope this helps
@schaefi this si the system state after the update.
What i had done
I have updated the command sequence from the host itself below.
1 2020-06-29 13:11:44 uname -a 2 2020-06-29 13:12:52 /usr/lib/systemd/network/99-default.link 3 2020-06-29 13:12:56 cat /usr/lib/systemd/network/99-default.link 4 2020-06-29 13:13:15 ip a 5 2020-06-29 13:18:00 cat /usr/lib/systemd/network/99-default.link 6 2020-06-29 13:19:12 cp /usr/lib/systemd/network/99-default.link /usr/lib/systemd/network/80-azure-li-net.link 7 2020-06-29 13:19:16 vi /usr/lib/systemd/network/80-azure-li-net.link 8 2020-06-29 13:19:43 dracut -f 9 2020-06-29 13:20:35 reboot 10 2020-06-29 13:33:40 exit 11 2020-10-14 01:37:19 zypper ref -s && zypper up 12 2020-10-14 01:48:49 reboot
The other thing is why is the default policy kernel for VLI's instead of path, and even if it is how are the interfaces getting correct names in that scenario.
@schaefi , any updates ?
@schaefi , were you able to reproduce this at your end as well ? can we reopen this bug ?
Sorry I'm completely confused.
I have sent out new production images a week or two ago which fixed all the interface name policy setup. Did you get those, did you have a change to test them ? I haven't received any feedback to my mail with the SAS urls for those
All images have changed according to their policy setup. You saw the PRs here and you reviewed them. I have no idea why the behavior should be different between images.
All LI images uses NamePolicy=mac, all VLI images uses NamePolicy=path. Nothing has changed in this regard. Do you request any different ?
The provided procedure from my last comment here in this report worked on all systems I have tested
I'm sorry all this is more than confusing to me and I don't understand how re-open this could help.
Can we please be more specific on what exactly is not working as you expect it. At best provide me ssh access to a machine where you think something is not correct.
Thanks
@schaefi , there were 2 parts to the problem
fixing in existing systems
ok, thanks. You said the procedure to fix existing systems is not consistent or does not work at all ? I've tested the procedure here again in a VM and could not find a problem. Do you have a system I can ssh to for further checking ?
@schaefi
Sorry for the late response, had to make some progress with work in the kiwi area
I have tested the image you mentioned and I can explain the difference. The image you tested is from 24.4.2020 but in this image the rule rewrite to the PATH policy was still done in a different way. For details see commit #43e5664c946a20e99ef8c6c4ab953fd2125a44b9. So the setup procedure using the systemd config file as described here came later.
In the image you have tested the the policy is applied using an udev rule. See the following file on your system
/usr/lib/udev/rules.d/81-net-setup-link.rules
This file rewrites the interfaces on the udev level not on the systemd level. This rewrite is however not the best solution as it should be done once by a correct setup of the link policy through systemd. Which is the reason why it was changed in the images.
The good news is that on systems which rewrites the interfaces through this extra 81-net-setup-link.rules you should not see the issue as we have it with the systemd config file.
Everything should just be ok on this system, before and after the update of udev.
I'm sorry I didn't thought about production images still be there using the 81-net-setup-link.rules file
Does this makes sense to you or did I confuse you ?
Thanks
In short customers running a production system that has /usr/lib/udev/rules.d/81-net-setup-link.rules should not see an issue
Hi Marcus,
For All our VLI images we have persistent names enabled,
But we are seeing that on an OS upgrade all these names are getting lost and these are named as ens2370f0. This should not be the case and the interface names should always be created with the same name. I was going through some commands and we do apply iD_NET_NAME_PATH on the interfaces but where the rules are getting generated, i couldn't get that. We also have net.ifnames=1 set in the boot parameters.
Will it be more feasible to just add a 70-persistent-net.rules file to the udev rules ? Should it not be automatically populated by the udev-rule generator ?