SUSE-Enceladus / azure-li-services

Azure Large Instance Services
GNU General Public License v3.0
7 stars 0 forks source link

[VLI] Predictable interface names are lost after upgrading OS #243

Closed jaiawasthi closed 4 years ago

jaiawasthi commented 4 years ago

Hi Marcus,

For All our VLI images we have persistent names enabled,

2: enP1p1s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000 link/ether 00:e0:ed:90:c6:be brd ff:ff:ff:ff:ff:ff 3: enp7s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:17:c7:73 brd ff:ff:ff:ff:ff:ff 4: enP1p1s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000 link/ether 00:e0:ed:90:c6:be brd ff:ff:ff:ff:ff:ff 5: enP2p1s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000 link/ether 00:e0:ed:90:57:4d brd ff:ff:ff:ff:ff:ff 6: enP2p1s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000

But we are seeing that on an OS upgrade all these names are getting lost and these are named as ens2370f0. This should not be the case and the interface names should always be created with the same name. I was going through some commands and we do apply iD_NET_NAME_PATH on the interfaces but where the rules are getting generated, i couldn't get that. We also have net.ifnames=1 set in the boot parameters.

azsollabdsm35:~ # udevadm test-builtin net_id /sys/class/net/enp7s0/ calling: test-builtin === trie on-disk === tool version: 228 file size: 6753794 bytes header size 80 bytes strings 1734970 bytes nodes 5018744 bytes Load module index Found container virtualization none timestamp of '/usr/lib/systemd/network' changed Configuration file /usr/lib/systemd/network/99-default.link is marked world-writable. Please remove world writability permission bits. Proceeding anyway. Parsed configuration file /usr/lib/systemd/network/99-default.link Created link configuration context. ID_NET_NAME_MAC=enx08006917c773 ID_OUI_FROM_DATABASE=SILICON GRAPHICS INC. ID_NET_NAME_PATH=enp7s0 Unload module index Unloaded link configuration context.

Will it be more feasible to just add a 70-persistent-net.rules file to the udev rules ? Should it not be automatically populated by the udev-rule generator ?

rjschwei commented 4 years ago

Fixed the title, predictable names are based on path and device etc., i.e. en.... and persistent names are the "legacy" names such as eth0, eth1 etc.

@jaawasth you say OS upgrade can you be more specific please? I sthis an upgrade between service packs, i.e. SLES 15 to SLES 15 SP1 or is this a distribution upgrade, i.e. SLES 12 SP5 to SLES 15 SP1?

Names are set by udev rules and udev is part of systemd. So we need to know starting and end point. Also please file a bug in bugzilla so we can involve other teams at SUSE if this turns out to be a udev/systemd issue.

Thanks

jaiawasthi commented 4 years ago

@rjschwei thanks for the update !! I still don;t have complete clarity, I'm getting it from customer engagement team. But from what they have described is that the customer patched the system [i believe they did a kernel upgrade/refresh ], I'll update the exact process that has been done. So the upgrade was for a SLES15 SP1 VLI image [i think it's just a patching of OS]

Names are set by udev rules and udev is part of systemd.

But do the udev rules always write out a 70-persistent-net.rules file. If there is some other way we are adding the "ID_NET_NAME_PATH", is it safe to add this rule still in the persistent-net.rules file ?

jaiawasthi commented 4 years ago

So, it does look, that the customer just did a kernel upgrade. After the kernel upgrade, the interface names switched to

From Dmesg: [ 162.098760] mlx5_core 0000:41:00.0 ens2370f0: renamed from eth4 [ 162.153688] mlx5_core 0000:41:00.1 ens2370f1: renamed from eth5 [ 162.364046] mlx5_core 0001:c1:00.0 enP1s2372f0: renamed from eth6 [ 162.434419] mlx5_core 0001:c1:00.1 enP1s2372f1: renamed from eth7

Please note, we expect the names to be predictable, so the interface names should be

enp65s0f0 - eth4 enp65s0f1 - eth5 enP1p193s0f0 - eth6 enP1p193s0f1 - eth7

The kernel has been upgraded from 4.12.14-197.45-default ----> 4.12.14-197.56-default

As a workaround to allow customer to continue having access to servers, following rules were added to 70-persistent-net.rules file

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses1", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth4" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses2", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth5" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses3", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth6" SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?", ATTR{address}=="mac-addresses4", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth", NAME="eth7"

Please note these are the mac addresses of the network devices.

jaiawasthi commented 4 years ago

The issue is reproducible locally in my environemnt now, if i do a zypper ref & zypper up. The interface names are changing. What were they before

sdflex01:~ # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: enp195s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:00 brd ff:ff:ff:ff:ff:ff 3: enp195s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:01 brd ff:ff:ff:ff:ff:ff 4: enp195s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:02 brd ff:ff:ff:ff:ff:ff 5: enp195s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:03 brd ff:ff:ff:ff:ff:ff 6: enp65s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 7: enp65s0f1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond1 state UP group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 8: enP1p193s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 9: enP1p193s0f1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9000 qdisc mq master bond1 state DOWN group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 10: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 11: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 12: vlan210@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff inet 10.100.0.179/24 brd 10.100.0.255 scope global vlan210 valid_lft forever preferred_lft forever 13: vlan211@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff inet 10.20.211.179/24 brd 10.20.211.255 scope global vlan211 valid_lft forever preferred_lft forever 14: vlan213@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff inet 10.20.213.179/24 brd 10.20.213.255 scope global vlan213 valid_lft forever preferred_lft forever 15: vlan212@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff inet 10.20.212.179/24 brd 10.20.212.255 scope global vlan212 valid_lft forever preferred_lft forever

What they change to

sdflex01:~ # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: enp195s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:00 brd ff:ff:ff:ff:ff:ff 3: enp195s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:01 brd ff:ff:ff:ff:ff:ff 4: enp195s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:02 brd ff:ff:ff:ff:ff:ff 5: enp195s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:03 brd ff:ff:ff:ff:ff:ff 6: ens2498f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:a7:74 brd ff:ff:ff:ff:ff:ff 7: ens2498f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:a7:75 brd ff:ff:ff:ff:ff:ff 8: enP1s2500f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:97:9c brd ff:ff:ff:ff:ff:ff 9: enP1s2500f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:94:97:9d brd ff:ff:ff:ff:ff:ff

Please note only the "UP" interfaces are impacted and not the down ones, so the "UP" interfaces are changing names which is why the network is going down. Interestingly, "DOWN" interfaces as still based on the pci buses names and slots rather than the pci bus slots.

schaefi commented 4 years ago

Hmm, maybe an update of udev is causing this. So our images comes with the following setting:

/usr/lib/systemd/network/99-default.link

[Link]
NamePolicy=path
MACAddressPolicy=persistent

Is that file changed after you ran the zypper up ?

If yes please open a bug against systemd-maintainers@suse.de and ask how to configure the system permanently for the setting above such that it will survive an update process.

Thanks

jaiawasthi commented 4 years ago

Thanks @schaefi , i did see both systemd & udev being upgraded but i didn't have a look at this file after wards, let me check again, I'll update you soon, will need to setup the system again.

jaiawasthi commented 4 years ago

@schaefi, I have created a bug against systemd-maintainers.

the name policy is getting changed to kernel [Link] NamePolicy=kernel database onboard slot path MACAddressPolicy=persistent

https://bugzilla.suse.com/show_bug.cgi?id=1176738

schaefi commented 4 years ago

ok so it's clear why the issue happened. Now we only need a solution. If you get further information from the bug please send a short notice in this report. I'm not looking on bugzilla as often as I look/get-notified here. Thanks

jaiawasthi commented 4 years ago

@schaefi i thikn changing the naming policy to "path" would be fine for fixing this issue temporarily, i have tried it and it works fine, but just want to ensure that's the only thing we need to do.

NamePolicy= path

Thanks !!

schaefi commented 4 years ago

want to ensure that's the only thing we need to do.

yes that's the fix you can apply manually to make the system work again. A real fix however must be done differently. I saw there was no response to the bug you have created so far. I guess we have to wait a little longer

jaiawasthi commented 4 years ago

@schaefi , thanks, also, I'm assuming the fixes will be in 2 parts.

  1. patch fixing by systemd, which will be available to the customers who are patching their current OS's
  2. these patches will be included in all our VLI images.
rjschwei commented 4 years ago

@schaefi I'm afraid this is ours:

https://www.freedesktop.org/software/systemd/man/systemd.link.html

rjschwei commented 4 years ago

@jaiawasthi

There is nothing that can be done to prevent this on updated. Before update users should run, as root:

mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network

jaiawasthi commented 4 years ago

@schaefi @rjschwei just to be sure,

  1. for users who already have their systems upgraded need to run below 2 steps mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network

Also, this will prevent changed interface names across future updates ?

  1. for all new images you will provide, the rules file will be added in the correct location, which will persist across updates and we will always have the same interface names. ?
schaefi commented 4 years ago

@jaiawasthi yes your thinking is correct. I will merge the changes from #244 today. This will result in new testing images which I can test in our environment as it's independent of your data center. If there is confirmation that the change really fixed it I will submit the images to the production(SUSE) namespace and will create another production image release.

Do you want a full set of new testing images to test in your space as well ?

Thanks

jaiawasthi commented 4 years ago

@schaefi since the changes involve essentially running just below 2 commands. mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network

I'll add these manually in our current setup and see if its working. Meanwhile please test in your setup & provide us the prod image directly. Thanks !!

schaefi commented 4 years ago

ok :+1:

jaiawasthi commented 4 years ago

@schaefi @rjschwei i tried applying the steps mentioned before an upgrade mkdir -p /etc/systemd/network cp /usr/lib/systemd/network/99-default.link /etc/systemd/network

sdflexOptanePart1:~ # cat /etc/systemd/network/99-default.link [Link] NamePolicy=path MACAddressPolicy=persistent

But the interfaces are not coming back up. Also, interfaces are named differently now sdflexOptanePart1:~ # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:ac brd ff:ff:ff:ff:ff:ff 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:ad brd ff:ff:ff:ff:ff:ff 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:ae brd ff:ff:ff:ff:ff:ff 5: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:1b:af brd ff:ff:ff:ff:ff:ff 6: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:32:84 brd ff:ff:ff:ff:ff:ff 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:32:85 brd ff:ff:ff:ff:ff:ff 8: eth6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:22:d4 brd ff:ff:ff:ff:ff:ff 9: eth7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether b8:83:03:92:22:d5 brd ff:ff:ff:ff:ff:ff

jaawasth commented 4 years ago

@schaefi , since this is not fixed, can we please repopen this issue, since I will not be able to do so.

schaefi commented 4 years ago

yes of course, sorry should have done this already

schaefi commented 4 years ago

Since there was an obs buildservice outage yesterday I will test the image with the change today. I expect the suggested fix to not work though, but wanted to double check

schaefi commented 4 years ago

I can confirm the fix did not work. The interface name in my test was 'ens3' where it should be based on the mac address. Same applies to the images that use the path policy.

I'm going to revert everything done here.

schaefi commented 4 years ago

sure that's what I'm currently doing. But your change broke everything because the namepolicy as we need is now not taken into account at all. I don't want to keep broken images around.

jaawasth commented 4 years ago

@schaefi @rjschwei did we find any solutions ?

schaefi commented 4 years ago

Solution was found. The bug is in dracut. For details see

I have added the changes from our side in PR #246 But now we have to wait until dracut gets fixed and a new dracut package gets released. That's why I set the labels respectively on the open pull request. A merge can only be done after a dracut update, otherwise the change in the open PR has no effect

jaawasth commented 4 years ago

Thanks @schaefi !!

rjschwei commented 4 years ago

Note that users cannot update until the dracut (fixed read grub previously which was incorrect) has been fixed and then they need to follow the cp instructions posted earlier.

schaefi commented 4 years ago

Note that users cannot update until the grub has been fixed and then they need to follow the cp instructions posted earlier.

@rjschwei I'm confused ?? how is grub related to this report ??

schaefi commented 4 years ago

@rjschwei you probably meant dracut not grub, and that's noted on PR #246 and the reason why the blocked label is set. I hope Thomas will provide a testing package such that we can target the branch build for the testing images as long as the release is not done

rjschwei commented 4 years ago

@schaefi yes, comment fixed

schaefi commented 4 years ago

We will fix this on the image description level without a dracut update.

jaawasth commented 4 years ago

Thanks Marcus, but did we need the settings for LI as well ?

schaefi commented 4 years ago

yes everywhere. in LI we have namePolicy set to MAC in VLI we have namePolicy set to PATH. For all descriptions we used /usr/lib/systemd/network/99-default.link. This means all images have the potential to be broken on an update of udev when this file gets overwritten :/

So I will update all images for all SLE

jaawasth commented 4 years ago

@schaefi , is there any interim solution which the customer can apply before upgrading their OS's so that thet dont run into this issue [without actually upgrading to the new SLES image which you would be releasing] ?

schaefi commented 4 years ago

yes. The current interim solution is:

cp /usr/lib/systemd/network/99-default.link /etc/systemd/network
echo 'install_items+=" /etc/systemd/network/99-default.link "' > /etc/dracut.conf.d/03-systemd.conf
dracut -f

This makes the setting permanent and update safe

But please wait with communicating this because our systemd maintainers doesn't like it. We are currently discussing other options

jaawasth commented 4 years ago

@schaefi , sure, i though we reached some agreement and you raised a PR for that ?

schaefi commented 4 years ago

Yes and all this is working. The PR is based on what I wrote in the interim solution and has been tested and also submitted to the SUSE namespace for a production release. I just need to click the button and send it to you.

But now people from the systemd maintainer team claimed that using /etc/systemd/network is not the right place to keep the config because it should be used for local modifications only. They suggested to use a higher prio named file and put it to /usr/lib/systemd/network. I prepared a test appliance and demonstrated that this does not work. This is the place where we are right now.

I'd like to give them another day or two to elaborate on this and once it's clear that I didn't do something completely stupid I will go and start the production release process with the solution from here.

If this is blocking you in some way please let me know

Thanks

schaefi commented 4 years ago

ok here is the result of my conversation with the systemd people. @jaawasth you can use the following as interim solution until we roll out new production images

cp /usr/lib/systemd/network/99-default.link /usr/lib/systemd/network/80-azure-li-net.link
dracut -f

NOTE: It's absolutely mandatory that you stick with the name 80-azure-li-net.link because that makes the sort order to be correct for systemd

jaawasth commented 4 years ago

Thanks Marcus, I'll test out and let you know as well. Will then recommend these to the customers.

jaawasth commented 4 years ago

@schaefi i tried testing today, sorry all the servers were busy in some testing. And i tried it with image SLES15-SP1-SAP-Azure-VLI-BYOS.x86_64-1.0.5-Production-Build1.127.raw.xz Is the behavior not consistent across images ?

The behavior is interesting, see we dont see the name policy as path in the image rather kernel I still created the file /usr/lib/systemd/network/80-azure-li-net.link with name policy as path and updated the server, it still lost the interface.

interface: enp65s0f0 mtu: 9000

interface: enp65s0f1 mtu: 9000

interface: enP1p193s0f0 mtu: 9000

interface: enP1p193s0f1

sdflex02:~ # ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: enp195s0f0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:00 brd ff:ff:ff:ff:ff:ff 3: enp195s0f1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:01 brd ff:ff:ff:ff:ff:ff 4: enp195s0f2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:02 brd ff:ff:ff:ff:ff:ff 5: enp195s0f3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000 link/ether 08:00:69:18:13:03 brd ff:ff:ff:ff:ff:ff sdflex02:~ # cat /usr/lib/systemd/network/80-azure-li-net.link [Link] NamePolicy=path database onboard slot path MACAddressPolicy=persistent sdflex02:~ # cat /usr/lib/systemd/network/99-default.link [Link] NamePolicy=kernel database onboard slot path MACAddressPolicy=persistent sdflex02:~ # uname -a Linux sdflex02 4.12.14-197.61-default #1 SMP Thu Oct 8 11:04:16 UTC 2020 (b98c600) x86_64 x86_64 x86_64 GNU/Linux

schaefi commented 4 years ago

As a side note I have released new production images today that will fix the issue, you should have an e-mail with further details

Back to the issue here, I think there is a misunderstanding. Your file 80-azure-li-net.link looks wrong to me. The suggested interim solution is based on a system that has not yet run "zypper up". Meaning a system that has not yet replaced 99-default.link with an udev update. If your system has already installed the udev update you can't copy 99-default.link as it already was replaced with a version that is not sufficient for our use case.

So in case your system has already an udev update installed the following needs to be done to fixup the settings:

1a) On LI systems create /usr/lib/systemd/network/80-azure-li-net.link with the following content:

   [Link]
   NamePolicy=mac
   MACAddressPolicy=persistent
   [Match]
   OriginalName=* 

1b) On VLI systems create /usr/lib/systemd/network/80-azure-vli-net.link with the following content:

   [Link]
   NamePolicy=path
   MACAddressPolicy=persistent
   [Match]
   OriginalName=* 

2) Call dracut

```
dracut -f
```

3) reboot the system

The easiest solution is to deploy the images that got released today. But if you need to fixup running servers it should be done in the above way

Hope this helps

jaawasth commented 4 years ago

@schaefi this si the system state after the update.

What i had done

  1. cp /usr/lib/systemd/network/99-default.link /usr/lib/systemd/network/80-azure-vli-net.link
  2. modify the policy to path here [this is vli] in the 80-azure-vli-net.link file
  3. dracut -f
  4. reboot
  5. update system
  6. share info with you about updated system [in the final state]

I have updated the command sequence from the host itself below.

1 2020-06-29 13:11:44 uname -a 2 2020-06-29 13:12:52 /usr/lib/systemd/network/99-default.link 3 2020-06-29 13:12:56 cat /usr/lib/systemd/network/99-default.link 4 2020-06-29 13:13:15 ip a 5 2020-06-29 13:18:00 cat /usr/lib/systemd/network/99-default.link 6 2020-06-29 13:19:12 cp /usr/lib/systemd/network/99-default.link /usr/lib/systemd/network/80-azure-li-net.link 7 2020-06-29 13:19:16 vi /usr/lib/systemd/network/80-azure-li-net.link 8 2020-06-29 13:19:43 dracut -f 9 2020-06-29 13:20:35 reboot 10 2020-06-29 13:33:40 exit 11 2020-10-14 01:37:19 zypper ref -s && zypper up 12 2020-10-14 01:48:49 reboot

The other thing is why is the default policy kernel for VLI's instead of path, and even if it is how are the interfaces getting correct names in that scenario.

jaawasth commented 4 years ago

@schaefi , any updates ?

jaawasth commented 4 years ago

@schaefi , were you able to reproduce this at your end as well ? can we reopen this bug ?

schaefi commented 4 years ago

Sorry I'm completely confused.

I'm sorry all this is more than confusing to me and I don't understand how re-open this could help.

Can we please be more specific on what exactly is not working as you expect it. At best provide me ssh access to a machine where you think something is not correct.

Thanks

jaawasth commented 4 years ago

@schaefi , there were 2 parts to the problem

  1. fixing in new images
  2. fixing in existing systems 2nd point is the one where I was testing out with the build version of the image I mentioned for an existing system. I hope its clearer now ? Else lets syncup over a call.
schaefi commented 4 years ago

fixing in existing systems

ok, thanks. You said the procedure to fix existing systems is not consistent or does not work at all ? I've tested the procedure here again in a VM and could not find a problem. Do you have a system I can ssh to for further checking ?

jaawasth commented 4 years ago

@schaefi

  1. I was trying to test on latest image, which does have a path policy defined. But just for testing, i picked an older version of image, as mentioned earlier that version is, SLES15-SP1-SAP-Azure-VLI-BYOS.x86_64-1.0.5-Production-Build1.127.raw.xz. This image interestingly has no path policy defined [still the network is configured properly]
  2. Its on this image that I tried testing the workaround, is it possible for you to test on the same image and see if you are able to reproduce this issue
  3. Yes, i have a system, please let me know a time [prefrably next week] when you would want to test it out, I can prepare a system before hand
schaefi commented 4 years ago

Sorry for the late response, had to make some progress with work in the kiwi area

I have tested the image you mentioned and I can explain the difference. The image you tested is from 24.4.2020 but in this image the rule rewrite to the PATH policy was still done in a different way. For details see commit #43e5664c946a20e99ef8c6c4ab953fd2125a44b9. So the setup procedure using the systemd config file as described here came later.

In the image you have tested the the policy is applied using an udev rule. See the following file on your system

/usr/lib/udev/rules.d/81-net-setup-link.rules

This file rewrites the interfaces on the udev level not on the systemd level. This rewrite is however not the best solution as it should be done once by a correct setup of the link policy through systemd. Which is the reason why it was changed in the images.

The good news is that on systems which rewrites the interfaces through this extra 81-net-setup-link.rules you should not see the issue as we have it with the systemd config file.

Everything should just be ok on this system, before and after the update of udev.

I'm sorry I didn't thought about production images still be there using the 81-net-setup-link.rules file

Does this makes sense to you or did I confuse you ?

Thanks

schaefi commented 4 years ago

In short customers running a production system that has /usr/lib/udev/rules.d/81-net-setup-link.rules should not see an issue