SUSE-Enceladus / azure-li-services

Azure Large Instance Services
GNU General Public License v3.0
7 stars 0 forks source link

LI/Gen4: Ethernet Port/Mac-address Pairings #115

Closed jeffaco closed 5 years ago

jeffaco commented 5 years ago

In Gen3, we supply a YAML, boot the system, and bingo, the network is up.

In Gen4, this is not the case. After investigation, we figured out that this is tied to file /etc/udev/rules.d/70-persistent-net.rules. For some reason, the pairings always seem to be correct in Gen3, but always seem to be incorrect in Gen4. I'm not quite sure why this is.

A "corrected" file is like this:

# This file was automatically generated by the /usr/lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.

# PCI device 0x1137:0x0043 (enic)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:b5:1a:00:0e", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth4"

# PCI device 0x1137:0x0043 (enic)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:b5:1a:00:0d", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

# PCI device 0x1137:0x0043 (enic)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:b5:1b:00:38", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth3"

# PCI device 0x1137:0x0043 (enic)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:b5:1b:00:37", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

# PCI device 0x1137:0x0043 (enic)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:b5:1b:00:0e", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth5"

# PCI device 0x1137:0x0043 (enic)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:25:b5:1a:00:0f", ATTR{dev_id}=="0x0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"

As per the instructions, we change the NAME field to be correct for the system, reboot, and we're good.

So, questions:

  1. Once modified, is the only option to reboot? Is there some way to dynamically re-read this file?
  2. How come the file is generated properly (100% of the time) on Gen3, but not on Gen4?
  3. How can we modify the YAML to generate a correct file the first time so that the network is reachable after first boot?
jeffaco commented 5 years ago

I can now answer question 2 above:

We made a change to the switch fabric in Gen4 where ethernet interfaces are tied to specific purposes. We did that for traffic routing purposes, so we can more easily control exactly what traffic goes over what port. The network design folks never tell me about these things in advance - sigh.

I'd still like answers to question 1 and 3 above, though. I'm hoping for an optional mac_address entry in the network configuration to allow specification of the mac address, perhaps something like this:

networking:
  -
    interface: eth0
    mac_address: "00:25:b5:1a:00:0d"
    vlan: 50
    vlan_mtu: 1500
    ip: 172.16.34.31
    gateway: 172.16.34.1
    subnet_mask: 255.255.255.0
    mtu: 1500

I'm open to suggestions, though. Thoughts?

schaefi commented 5 years ago

Once modified, is the only option to reboot? Is there some way to dynamically re-read this file?

persistent udev network rules only apply at udev start and in this case even on kernel network driver load. So no, there is no other way than reboot. That rules should exist at reboot time and I can only offer to add those to the image description for the Very Large Instance build or do you see that on the Large Instance image deployed to you too ?

How can we modify the YAML to generate a correct file the first time so that the network is reachable after first boot?

This should not happen on the yaml level. You request interface names per mac address and that's pure udev/kernel level creating the interface names.

Last but not least you marked that as a bug on our side but I don't see how this is a bug ? your fabric changed and that influenced the enic rules. We can adapt our image build but to be honest this type of changes are quite painful to any system that deals with persistent network interface names

Deleting the bug flag and setting the discussion flag as I think we can only help out on the udev rule setup which I'm not sure if you want. Any next change in your fabric will again cause this trouble

rjschwei commented 5 years ago

Some more info on this.

The eth* names are assigned, as Marcus already pointed out at boot time by the kernel. The names are assigned i.e. eth0 to eth? based on the order of discovery, or stated otherwise based on when a udev event is triggered. A udev event is triggered when the device becomes available from the HW side to the kernel. The order maybe different at every boot, so hard coding the MAC address into the ifcfg-ethX file is risky as the next time the hard coded MAC address might not match with the interface name which means the interface will not be brought up by wicked.

One way to avoid this problem is to switch to predictable names [1] for network interfaces. This scheme has a different set up problems, but those problems would not apply in our use case. The image is always deployed on the same hardware Gen 3 or Gen 4 and based on that it is known where the interfaces are and what they should be named. Thus, if the YAML config would write the interface names as predictable names and we switch the image build to use predictable names we can avoid the stated problem in the original posting. Instead of

interface: eth0

you would then have

interface: ens1

for example. If we switch to predictable names we do not need to make any changes to the initialization code or the YAML syntax to handle both Gen 3 and Gen 4 in a consistent way.

  1. Once modified, is the only option to reboot? Is there some way to dynamically re-read this file?

Yes

  1. How come the file is generated properly (100% of the time) on Gen3, but not on Gen4?

Luck

How can we modify the YAML to generate a correct file the first time so that the network is reachable after first boot?

We cannot the file you are listing is not touched by the initialization code. The initialization code only writes the ifcfg-eth* files. Those files determine how wicked brings up the interfaces.

[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

jeffaco commented 5 years ago

Ooh, shudder. I didn't realize this was such a sticky issue. This, by the way, is a high priority issue for us since it affects LI, which affects the bulk of our deployments. (We don't have VLI images yet anyway.)

I gave this issue a "bug" label because the network was not coming up properly. Please accept my apology if you don't consider it a bug.

I read over the posting on Predictable Network Interface Names, so I understand that much, at least. This will bleed over to customer visibility, and perhaps software (not sure), so I'll need to drag other folks in to weigh in on any potential solutions.

This is a general change in Gen4, so whatever we come up with should handle both LI and VLI, as both will be affected.

I had number of questions and some comments:

What we've done before in the past (what we were doing before SUSE-generated images):

  1. On first boot, the interfaces are named as something random. We fix this in the /etc/udev/rules.d/70-persistent-net.rules and reboot, which allows the network to come up.
  2. After that, we delete the /etc/udev/rules.d/70-persistent-net.rules file. But even with that file deleted, the network names are extremely consistent after that. I don't personally understand how this could be, unless udev writes to some other file allowing predictable ordering once the names are established (as long as underlying hardware doesn't change).
  3. Because ordering of interfaces are now consistent, we deploy the image with the removed /etc/udev/rules.d/70-persistent-net.rules file, and on first boot of that new image (on new hardware), the interface names are always correct. This, honestly, is a little beyond my understanding, but the network folks said this was so, so I believe them, even if I don't fully understand how it's so.

So, given all this, suggestions? How can we predictably tie the O/S interfaces to the network fabric interfaces if not using some common piece of information (i.e. mac address)?

rjschwei commented 5 years ago

Question: Does the interface name in the O/S really matter?

Yes and No

From software, other than configuring the network, does it matter what the name of the interface is?

No

Regarding interface name changes (i.e. ens1), this is customer visible

Yes, if the customer runs "ip a" or looks at /etc/sysconfig/network. I would not necessarily call this customer visible.

would a "quick reboot" suffice

Yes, a kexec will re-enumerate all the devices.

How can we predictably tie the O/S interfaces to the network fabric interfaces if not using some common piece of information (i.e. mac address)?

That's the problem predictable network interfaces solves, as implied by your question using "predictably". The interface on a given bus in a given slot always has the same name no matter in which order the interface is detected. Based on your explanation of UCS that would apply. UCS should apply the same configuration to a new blade, i.e. the same network interface with the same MAC shows up in the same position (bus and slot) on a new blade.

The "MAC address" while common knowledge is not anything the kernel uses. The kernel couldn't care less what the MAC address of a device is when an interface is first detected as present.

If we do not use predictable names and we add the MAC address to the YAML then we will end up in the following situation.

That the removal of /etc/udev/rules.d/70-persistent-net.rules file works is luck nothing else. The code that writes this file runs on every reboot and for every rule that is not in /etc/udev/rules.d/70-persistent-net.rules a new rule will be generated, i.e. the file gets re-created on every boot if it does not exist. The generation code also runs when a new network interface is added.

So if we have to stick to persistent names, i.e. use the "ethX" naming then the initialization code can/has to modify "/etc/udev/rules.d/70-persistent-net.rules" and in that file assign the names based on MAC address as we want them to. This would then be persistent across all boots as the information from /etc/udev/rules.d/70-persistent-net.rules will be applied on every boot and device names will be properly renamed to match the information in this file. This approach means

The real solution to the problem described is to use predictable interface names, sticking with persistent names just creates a cascading effect of work around solutions that introduce a potential race condition. Every race condition will eventually be met for unexplained reasons and then the system will have no working network.

I strongly suggest that we move to predictable names for this setup.

jeffaco commented 5 years ago

Thanks for the detailed reply.

To be clear, if you guys feel that predictable names makes sense, I'm fine with that (and will push that solution on our end since, to stakeholders, they may not like the fact that there is a new naming convention).

However, I'm still a little vague on what ties this to network fabric. If the network fabric has a notion of eth0-eth5, each with specific Mac addresses (I know the kernel doesn't really care about Mac addresses), what ties the network fabric's notion of eth0 with the O/S notion of ens1?

Here's the network fabric's notion of a blade:

2019-02-11 - ucs eth ports

So, in this example, what ties the network fabric's notion of eth0 with ens1, if we move to predictable names? As I said, I am not adverse to predictable names at all. I just would like to understand how it would work ...

Could we get a test image to see if it actually works in our environment?

rjschwei commented 5 years ago

OK, I see the confusion. Again these are just names. The network fabric happens to name them as eth? name but that really has (should have) nothing to do with what the kernel names the devices.

What would make sense, and I am not saying that UCS is implemented that way, is that you give a network interface a certain name in the fabric setup and you associate that with a certain MAC address. Then the fabric assigns that MAC address to a given network card in the blade.

So let's say we have a blade system with 3 NICs. From a hardware perspective all blades are the same and the NICs are all attached to the same bus in the same order. Meaning the network fabric has some notion that the NIC in slot one is "eth0", the NIC in slot two is "eth1" and the NIC in slot three is "eth2". So when you assign "mac-A" to eth0 this means that the NIC in slot 1 will get "mac-A" from the fabric. "mac-B" would go to the NIC in slot 2 and "mac-C" to the NIC in slot 3. However this should have nothing to do with the way the kernel perceives the NICs. As far as the kernel is concerned the NIC in slot 1 gets a name according to the rules we use "persistent" (eth?) or "predictable" (ens?). If we use persistent names on the kernel side the name may or may not match with what's set up in the fabric. As discussed that depends on the order of udev events, i.e. jitters in the HW initialization. If the NIC in slot 2 happens to show up first it will get the "eth0" name in the kernel and will have the "eth1" name in the fabric. Basically what you observed. ANd what started this discussion.

Now if the YAML is generated based on what's in the fabric things will me cross wired, again this is consistent with the observed behavior.

In the above I say "should" because I do not know what he Cisco kernel driver modules do and if they do or do not establish some correlation to the fabric. Although that would be weird.

Given the behavior you showed at the beginning of this thread I would say the Cisco drivers do not establish such weird connections to the fabric.

So from a configuration perspective it is actually easier for you to generate the YAML based on the fabric information because you can establish the mapping that "eth0" in the fabric is the NIC in the first slot on some bus in the blade and that will always be the same and so if we use persistent names there is very little that can go wrong.

jeffaco commented 5 years ago

So from a configuration perspective it is actually easier for you to generate the YAML based on the fabric information because you can establish the mapping that "eth0" in the fabric is the NIC in the first slot on some bus in the blade and that will always be the same and so if we use persistent names there is very little that can go wrong.

That sounds awesome to me. So, if I'm understanding you properly, if we move to persistent names, then regardless of startup "jitter", ens1 will always be what the network fabric sees as eth0, like this:

Fabric NIC OS persistent name
eth0 ens1
eth1 ens2
eth2 ens3
eth3 ens4
eth4 ens5
eth5 ens6

If that's the case, that sounds awesome, and would solve the issue completely.

I guess I have some remaining questions:

  1. Is this a change to the O/S, or just a YAML change? If the later, what do I change? If the former, then:

  2. Test image possible? I'm thinking "yes" since you are the professionals in image building 😃

  3. Out of curiosity, why weren't existing interfaces eth0 ... eth(n) just set up as persistent names by default in the Linux kernel, thus avoiding this issue to begin with? Backwards compatibility issues?

rjschwei commented 5 years ago

So from a configuration perspective it is actually easier for you to generate the YAML based on the fabric information because you can establish the mapping that "eth0" in the fabric is the NIC in the first slot on some bus in the blade and that will always be the same and so if we use persistent names there is very little that can go wrong.

That sounds awesome to me. So, if I'm understanding you properly, if we move to persistent names,

Oops my bad, we would switch to "predictable" names.

then regardless of startup "jitter", ens1 will always be what the network fabric sees as eth0, like this:

Fabric NIC OS persistent name eth0 ens1 eth1 ens2 eth2 ens3 eth3 ens4 eth4 ens5 eth5 ens6

Yes that would be the map. We just have to determine whether the interfaces are really "ens" or some other name.

If that's the case, that sounds awesome, and would solve the issue completely.

I guess I have some remaining questions:

  1. Is this a change to the O/S, or just a YAML change?

It is a change to:

but NOT a change to the YAML schema or setup code.

If the later, what do I change?

You would write

interface: ens1

instead of

interface: eth0

If the former, then:

  1. Test image possible? I'm thinking "yes" since you are the [professionals in image building]

You'll have a SLES 12 SP4 For SAP test image tomorrow

(https://github.com/SUSE-Enceladus/azure-li-services/pull/116#issuecomment-462470148) smiley

  1. Out of curiosity, why weren't existing interfaces eth0 ... eth(n) just set up as persistent names by default in the Linux kernel, thus avoiding this issue to begin with? Backwards compatibility issues?

Well the debate about "persistent" (eth0) vs. "predictable" (ens1) names is long and it is a stony road. The argument for predictable names and why they make sense is clearly displayed here. However "predictable" is not "predictable " ahead of time. Meaning I cannot tell you what the names of the interfaces in the UCS blades will actually be. One has to know the internals of the HW to predict the names. So I cannot predict what the interface on a given piece of HW will be named until I have done at least one installation and let the "predictable" name logic figure it out. Yes, there is also the compatibility issue and many years of scripts that do special stuff based on the "knowledge" that there will be something called "eth0". Shudder, but that is reality. Also in many environments there is only 1 NIC so it's always eth0 and jitter doesn't matter.

schaefi commented 5 years ago

Lots of information since I left the desk yesterday. So yes predictable network interface names are the solution to this problem. I will move the image descriptions now to activate net.ifnames properly and I think we are good with the other open PR. So the image will have both issues addressed. If all works out of the box is something I can't tell and we need your help and feedback to come to the final solution

stay tuned

schaefi commented 5 years ago

Devel Images have all been updated to use predictable network interface names

jeffaco commented 5 years ago

Predictable names are - well - yucky, particularly in the fact that they are not all that predictable. The problem with predictable names:

On a Gen4 UCS Test Node:

Sollabdsm31:~ # for i in 0 1 2 3 4 5; do echo "----- For eth$i -----"; udevadm test-builtin net_id /sys/class/net/eth$i 2>/dev/null | grep '^ID_NET_NAME_'; done
----- For eth0 -----
ID_NET_NAME_MAC=enx0025b51b000e
ID_NET_NAME_PATH=enp72s0
----- For eth1 -----
ID_NET_NAME_MAC=enx0025b51a000e
ID_NET_NAME_PATH=enp73s0
----- For eth2 -----
ID_NET_NAME_MAC=enx0025b51b0038
ID_NET_NAME_PATH=enp74s0
----- For eth3 -----
ID_NET_NAME_MAC=enx0025b51a000f
ID_NET_NAME_PATH=enp80s0
----- For eth4 -----
ID_NET_NAME_MAC=enx0025b51b0037
ID_NET_NAME_PATH=enp81s0
----- For eth5 -----
ID_NET_NAME_MAC=enx0025b51a000d
ID_NET_NAME_PATH=enp82s0
Sollabdsm31:~ #

So if we're avoiding the Mac-based ethernet addresses, that leaves enp* interfaces (72s0-74s0 and 80s0-82s0). Okay.

However, on a second Gen4 UCS test node:

Sollabdsm32:~ # for i in 0 1 2 3 4 5; do echo "----- For eth$i -----"; udevadm test-builtin net_id /sys/class/net/eth$i 2>/dev/null | grep '^ID_NET_NAME_'; done
----- For eth0 -----
ID_NET_NAME_MAC=enx0025b51b0012
ID_NET_NAME_PATH=enp200s0
----- For eth1 -----
ID_NET_NAME_MAC=enx0025b51b0011
ID_NET_NAME_PATH=enp201s0
----- For eth2 -----
ID_NET_NAME_MAC=enx0025b51a0014
ID_NET_NAME_PATH=enp202s0
----- For eth3 -----
ID_NET_NAME_MAC=enx0025b51a0013
ID_NET_NAME_PATH=enp208s0
----- For eth4 -----
ID_NET_NAME_MAC=enx0025b51a0012
ID_NET_NAME_PATH=enp209s0
----- For eth5 -----
ID_NET_NAME_MAC=enx0025b51a0011
ID_NET_NAME_PATH=enp210s0
Sollabdsm32:~ #

This is totally unexpected, as the interface names are different from blade to blade. This is awful, and implies that we would need to put on O/S on each and every blade just to figure out what the interfaces would be. And if we needed to move a profile to a different blade (due to blade failure, say, which UCS supports), networking would be problematic. Yuck!

Why, on one node, is eth0 -> enp72s0, while on the other node eth0 -> enp200s0? I really expected these to be consistent from UCS blade to UCS blade ...

I'm actually thinking that Mac-Address based names might be better, although I'm having different issues with that one (however, I think these issues might be able to be resolved, will need to check with networking folks). At least I can reasonably predict that if I know the Mac address ...

Is there a way that predictable names can be predictable from UCS blade to UCS blade?

schaefi commented 5 years ago

This is totally unexpected, as the interface names are different from blade to blade.

It means your blades are different.

eth4: enp209s0
eth4: enp81s0

The card on one system is at pci bus location 209 (slot 0) and the card on another system is on pci bus location 81 (slot 0). The fact that both got assigned eth4 in the past is pure luck because the device ordering when the kernel sees it is non deterministic.

predictable names for the network cards assumes you know the location where on the PCI bus the card is plugged in. If there is no logic for us to know that it's gonna be very hard. The logic you used before is unstable as Robert already explained.

So the question here is:

If the answer is that all is different from one blade to another then we need a host specific information in the yaml file. Which in other words means you need to know the interface name or the bus location per instance. I guess this is a cluster and the selection of the blade that actually runs the system is at another level of the infrastructure ?

Houston we have a problem

rjschwei commented 5 years ago

Well we can try and write ifcfg-enx* files. But I do not know what wicked does in that case, so that needs to be tested. The only we apparently know is the MAC address as that gets assigned by the UCS network fabric software. Then the YAML would contain

`interface: enx0025b51b0012'

for example and the setup could would write "ifcfg-enx0025b51b0012". But I have no idea if wicked will find that interface based on this name. With predictable names entry for "enp200s0" will exist in /sys/class/net but no entry based on MAC address will exist. I don't think the MAC based identification will work, but it is worth a test.

schaefi commented 5 years ago

I asked Marius about this...

However how would that help ? Any card has a unique MAC address. This would also mean each blade has to have a dedicated yaml file with interface: enx0025b51b0012 or we create a udev rule that maps any potentially existing mac address to an interface name. None of this looks like a nice solution to me

rjschwei commented 5 years ago

The MAC address is set by the network fabric of the USC system and is known as @jeffaco has shown in an earlier comment.

Anyway I had another idea this morning about how to solve this problem. We can rewrite the persistent-netrules file in a save way. Here is my proposal https://github.com/SUSE-Enceladus/azure-li-services/blob/netEnum/azure_li_services/nic_enumeration.py

I only implemented the core logic. full integration with tests and service setup etc needs to be completed if we agree on this approach. The YAML would then get a "mac" entry.

schaefi commented 5 years ago

The MAC address is set by the network fabric of the USC system and is known as @jeffaco has shown in an earlier comment.

ok so that means the MAC is the stable factor here and will become a mandatory setup in the network section

I got further information from Marius. In theory (untested) we could provide a udev rule in our images that does this

/etc/udev/rules.d/79-net-rename-mac.rule

SUBSYSTEM=="net", ACTION=="add", NAME=="", ENV{ID_NET_NAME_MAC}!="",
NAME="$env{ID_NET_NAME_MAC}"

This should allow us to use the mac based interface name. A rewrite of the net-rules to ethX would then be not required and the interface name would explicitly be named according to its mac. I haven't tested this but if it works I would prefer this approach

Will do some testing

schaefi commented 5 years ago

Ok I had some success in my testing. I would go the following way:

  1. In our image descriptions we adapt 80-net-setup-link.rules to create interfaces based on the MAC

    Move from ID_NET_NAME to ID_NET_NAME_MAC

    This will result in interface names looking like this:

    2: enxdeadbeefb8c2: <BROADCAST,MULTICAST> mtu 9000 qdisc pfifo_fast state DOWN group default qlen 1000
         link/ether de:ad:be:ef:b8:c2 brd ff:ff:ff:ff:ff:ff

    And allows us to create ifcfg- configurations based on the mac address name

  2. The yaml file needs to specify network configurations as such

    networking:
      interface: enxdeadbeefb8c2

I have tested this and it worked well for me. So no rewriting of rules required in my opinion and no potential race condition on the timing waiting for the network interface names to be rewritten

Thoughts ?

jeffaco commented 5 years ago

I was going to try using MAC addresses as the interface name. Robert had asked for that to be tested, and it was on my list (right behind testing why, on Gen4, the disk wasn't resized - I didn't forget about that).

He was skeptical that would work, however.

So I guess this mechanism definitely works? But we'd need a new image to test with (because of the image description change)? I did get a test image to work with predictable names (like enp209s0, although this turns out to not be predictable at all). Would that image work, or is this change somewhat more involved?

jeffaco commented 5 years ago

I heard back from Cisco (I asked them why their predictable network device names weren't predictable, referring them to this post).

They responded:

Thanks for the background. Just curious if you have try the consistent device name(CDN) in the vnic template or directly in the service profile? Here is the documentation for it.

You also need to enable CDN control in the BIOS policy before enabling CDN in the VNIC template/service profile. Reboot the server to take effect.

If you search for CDN on that page, you get to the relevant stuff. And sure enough, that page discusses the very thing that I'm encountering:

When there is no mechanism for the Operating System to label Ethernet interfaces in a consistent manner, it becomes difficult to manage network connections with server configuration changes. Consistent Device Naming (CDN), introduced in Cisco UCS Manager Release 2.2(4), allows Ethernet interfaces to be named in a consistent manner. This makes Ethernet interface names more persistent when adapter or other configuration changes are made.

The default behavior for CDN on UCS is disabled, and that's what we're currently using. This can be changed, however. BUT: I noted on Cisco's link that SLES doesn't appear to be supported? The supported O/S list is:

It's interesting that RHEL is supported but SUSE is not, since they use common kernels (although the kernel configuration might be different).

CDN might fit the bill better because we wouldn't have to worry about doing automation to get the MAC addresses. Of course, CDN isn't an option on our VLI systems, where we'd presumably need to use predictable names there anyway. But I suspect that, on VLI, the names will actually be predictable. I can't be certain yet since I only have access to one Gen4 VLI system.

This terminology is a little numbing - regular network names, predictable names (based on MAC address or other factors), consistent names ...

So, some questions:

  1. Why are CDN (Consistent Device Names) not supported by SLES? Will they work, or does this take engineering effort that RedHat did that SUSE did not?

  2. Are CDN names a better option for us (since the names sound predictable from blade to blade, and move with the profile)? It sounds like they are, but only if they'll work for SLES ...

Thanks for your thoughts!

schaefi commented 5 years ago

He was skeptical that would work, however. So I guess this mechanism definitely works?

yes, it does

But we'd need a new image to test with

correct, the one we gave you use predictable names but based on PCI location not based on MAC. As you nicely explained, based on PCI location is not really predictable. So yes a new image would be needed.

Why are CDN (Consistent Device Names) not supported by SLES?

I can't answer this question. Either Robert knows why or I will be asking the right people next week when I'm in NUE

Are CDN names a better option for us

It would be less work from your side. As you said if we go with the MAC assignment the information "which MAC per interface" is an information you would need to provide in the yaml per instance. If you can trust the system with CDN to provide the same interface names for any instance that would lower the work on your side. However as we are in the enterprise business I would not use features from the kernel that are flagged as not supported. I know many technologies works no matter what their official support status is but we should not go that route imho

rjschwei commented 5 years ago

I think we should stick with the predictable names based on MAC address, test image now available. That is straight forward and does not need any features on the system that have questionable support status.

As far as CDN is concerned I suspect that may simply be a documentation or a test issue. Since CDN is determined by BIOS setting this should apply equally to all Linux distributions. After all reading the firmware information is pretty much standard.

schaefi commented 5 years ago

JFI: Had a conversation with Marius about CDN (bios device names) and he also recommended to not use them. The reason here is simple. The names are presented by the BIOS to the system. This means it depends on the BIOS itself if we get them and it also depends on the BIOS if they are correct. Any change on that level with run through the system and will cause harm on our implementation.

schaefi commented 5 years ago

I have activated the ID_NET_NAME_MAC in our devel image builds. It is done with an additional rule 81-net-setup-link.rules which comes directly after 80-net-setup-link and rewrites the interfaces to their mac based representation. Also tested the setup and adapted my integration test build.

So from my perspective all coding work for this issue is done.

jeffaco commented 5 years ago

I booted what I believe is the latest test image. Networking was not up at all. Here's the output of ifconfig -a:

enx0025b5 Link encap:Ethernet  HWaddr 00:25:B5:1A:00:15  
          BROADCAST MULTICAST  MTU:9000  Metric:1
          RX packets:0 errors:0 dropped:18 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:144 (144.0 b)  TX bytes:0 (0.0 b)

enx0025b5 Link encap:Ethernet  HWaddr 00:25:B5:1A:00:16  
          BROADCAST MULTICAST  MTU:9000  Metric:1
          RX packets:0 errors:0 dropped:18 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:144 (144.0 b)  TX bytes:0 (0.0 b)

enx0025b5 Link encap:Ethernet  HWaddr 00:25:B5:1A:00:17  
          BROADCAST MULTICAST  MTU:9000  Metric:1
          RX packets:0 errors:0 dropped:18 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:144 (144.0 b)  TX bytes:0 (0.0 b)

enx0025b5 Link encap:Ethernet  HWaddr 00:25:B5:1A:00:18  
          BROADCAST MULTICAST  MTU:9000  Metric:1
          RX packets:0 errors:0 dropped:18 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:144 (144.0 b)  TX bytes:0 (0.0 b)

enx0025b5 Link encap:Ethernet  HWaddr 00:25:B5:1B:00:0F  
          BROADCAST MULTICAST  MTU:9000  Metric:1
          RX packets:0 errors:0 dropped:18 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:144 (144.0 b)  TX bytes:0 (0.0 b)

enx0025b5 Link encap:Ethernet  HWaddr 00:25:B5:1B:00:10  
          BROADCAST MULTICAST  MTU:9000  Metric:1
          RX packets:0 errors:0 dropped:18 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:144 (144.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:10 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:660 (660.0 b)  TX bytes:660 (660.0 b)

I'm pretty sure I'm using the right image due to the enx devices in the above output.

I ended up grabbing the network configuration files in /etc/sysconfig/network, and those weren't what I expected at all. Here's a ifcfg.tar.gz with the contents of that directory. It was still configuring the eth* devices, which I didn't expect - I thought those files would be mentioning the MAC addresses.

In case it's relevant, here's the suse_firstboot_config.yaml file.

Let me know if you need additional information, thanks.

jeffaco commented 5 years ago

By the way: Because networking failed to come up, there was a deployment error in storage (couldn't mount storage devices).

I would have expected to see an error file from this in the config LUN, but did not:

Sollabdsm33:~ # ls /mnt/yaml3/
lost+found  rpms  scripts  ssh  suse_firstboot_config.yaml
Sollabdsm33:~ #

The console clearly showed a deployment error, however. What happened here? Why no logging details? The configuration disk was clearly mounted since other things were set (accounts, etc).

rjschwei commented 5 years ago

@jeffaco what did the YAML look like for the attempt with the failed network setup? Also, since the names of the interfaces are now long please us the ip a command so we can see the full device names and not the names shortened by ifconfig.

jeffaco commented 5 years ago

I included the YAML in my original message with the results, I think you missed that.

Output from ip a command:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enx0025b51a0015: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether 00:25:b5:1a:00:15 brd ff:ff:ff:ff:ff:ff
3: enx0025b51a0016: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether 00:25:b5:1a:00:16 brd ff:ff:ff:ff:ff:ff
4: enx0025b51a0017: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether 00:25:b5:1a:00:17 brd ff:ff:ff:ff:ff:ff
5: enx0025b51a0018: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether 00:25:b5:1a:00:18 brd ff:ff:ff:ff:ff:ff
6: enx0025b51b000f: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether 00:25:b5:1b:00:0f brd ff:ff:ff:ff:ff:ff
7: enx0025b51b0010: <BROADCAST,MULTICAST> mtu 9000 qdisc noop state DOWN group default qlen 1000
    link/ether 00:25:b5:1b:00:10 brd ff:ff:ff:ff:ff:ff

Let me know what else you need, thanks.

rjschwei commented 5 years ago

Thanks, OK those things match up, so we need to look at what wicked did. @jeffaco we'll need

ls /etc/sysconfig/network

then we'll want to know what wicked produced as far as messages is concerned. Use journalctl -u for each of these services:

wicked.service wickedd-auto4.service wickedd-dhcp4.service wickedd-dhcp6.service wickedd-nanny.service wickedd.service

jeffaco commented 5 years ago

In the above message, I said:

I ended up grabbing the network configuration files in /etc/sysconfig/network, and those weren't what I expected at all. Here's a ifcfg.tar.gz with the contents of that directory. It was still configuring the eth* devices, which I didn't expect - I thought those files would be mentioning the MAC addresses.

If I understand what you're after, couldn't you have gotten the results of the ls command from the ifcfg.tar.gz file? And the contents of the files, too, in case that was relevant? Or maybe I don't fully understand what you're after.

In any case, here's the output from each of the commands you asked for. I did the ls command with -l to insure you could differentiate from directories and files.

Output from ls -l /etc/sysconfig/network:

total 92
-rw-r--r-- 1 root root  9692 Feb 16 12:15 config
-rw-r--r-- 1 root root 13520 Feb 16 12:16 dhcp
drwxr-xr-x 2 root root     6 Jun 27  2017 if-down.d
drwxr-xr-x 2 root root    27 Feb 16 12:15 if-up.d
-rw-r--r-- 1 root root    85 Feb 21 00:12 ifcfg-eth0
-rw-r--r-- 1 root root   146 Feb 21 00:12 ifcfg-eth0.250
-rw-r--r-- 1 root root    87 Feb 21 00:12 ifcfg-eth1
-rw-r--r-- 1 root root   148 Feb 21 00:12 ifcfg-eth1.251
-rw-r--r-- 1 root root    87 Feb 21 00:12 ifcfg-eth2
-rw-r--r-- 1 root root   148 Feb 21 00:12 ifcfg-eth2.252
-rw-r--r-- 1 root root    87 Feb 21 00:12 ifcfg-eth3
-rw-r--r-- 1 root root   148 Feb 21 00:12 ifcfg-eth3.253
-rw------- 1 root root   147 Dec  5 14:14 ifcfg-lo
-rw-r--r-- 1 root root 21738 Oct 14  2016 ifcfg.template
-rw-r--r-- 1 root root    29 Feb 21 00:12 ifroute-eth0.250
drwx------ 2 root root     6 Jun 27  2017 providers
drwxr-xr-x 2 root root    97 Feb 16 12:15 scripts

Output from journalctl -u wicked.service:

-- Logs begin at Thu 2019-02-21 20:17:04 UTC, end at Thu 2019-02-21 20:30:01 UTC. --
Feb 21 20:18:02 Sollabdsm31 systemd[1]: Starting wicked managed network interfaces...
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: lo              up
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth0            no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth0.250        no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth1            no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth1.251        no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth2            no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth2.252        no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth3            no-device
Feb 21 20:18:32 Sollabdsm31 wicked[4686]: eth3.253        no-device
Feb 21 20:18:32 Sollabdsm31 systemd[1]: Started wicked managed network interfaces.

Output from journalctl -u wickedd-auto4.service:

-- Logs begin at Thu 2019-02-21 20:17:04 UTC, end at Thu 2019-02-21 20:30:01 UTC. --
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Starting wicked AutoIPv4 supplicant service...
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Started wicked AutoIPv4 supplicant service.

Output from journalctl -u wickedd-dhcp4.service:

-- Logs begin at Thu 2019-02-21 20:17:04 UTC, end at Thu 2019-02-21 20:30:01 UTC. --
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Starting wicked DHCPv4 supplicant service...
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Started wicked DHCPv4 supplicant service.

Output from journalctl -u wickedd-dhcp6.service:

-- Logs begin at Thu 2019-02-21 20:17:04 UTC, end at Thu 2019-02-21 20:30:01 UTC. --
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Starting wicked DHCPv6 supplicant service...
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Started wicked DHCPv6 supplicant service.

Output from journalctl -u wickedd-nanny.service:

-- Logs begin at Thu 2019-02-21 20:17:04 UTC, end at Thu 2019-02-21 20:30:01 UTC. --
Feb 21 20:18:02 Sollabdsm31 systemd[1]: Starting wicked network nanny service...
Feb 21 20:18:02 Sollabdsm31 systemd[1]: Started wicked network nanny service.

Output from journalctl -u wickedd.service:

-- Logs begin at Thu 2019-02-21 20:17:04 UTC, end at Thu 2019-02-21 20:30:01 UTC. --
Feb 21 20:18:01 Sollabdsm31 systemd[1]: Starting wicked network management service daemon...
Feb 21 20:18:02 Sollabdsm31 systemd[1]: Started wicked network management service daemon.

Let me know if you need more information, thanks so much for your help!

rjschwei commented 5 years ago

That's why the network is not up, the files are named ifcfg-ethX but the interface names are enx0025b51a0015. Not sure why the service generated the wrong interface file names. What should be in /etc/sysconfig/network is

ifcfg-enx0025b51a0015

for example.

jeffaco commented 5 years ago

Yeah, I saw that and noted that in my quote above.

Any idea why that happened? A problem with my YAML (misunderstanding of what to put, perhaps?), or a problem with the code?

rjschwei commented 5 years ago

@jeffaco Sorry for making you double work, yes I missed half of what you said in the earlier comment :( I guess I was too distracted by the truncation from ifconfig.

Anyway I looked at the code and it doesn't care what the name is. The code creates the name of the ifcfg- files based on the value of "interface". In theory things should match up.

jeffaco commented 5 years ago

Okay, I'll let you or Marcus take a closer look at the code to figure out why theory doesn't match up with reality.

I'm trying to bring up the interfaces (at least the client network) by:

  1. Renaming ifcfg-eth0 to ifcfg-enx0025b51a0015
  2. Renaming ifcfg-eth0.250 to ifcfg-enx0025b51a0015.250 and editing both DEVICE and ETHERDEVICE in that file, and
  3. Renaming ifroute-eth0.250 to ifroute-enx0025b51a0015 and editing the file to contain:
default 10.60.0.1 - enx0025b51a0015.250

After that, I restart wicked and the network doesn't come up. Makes it a pain in the tush as I'm tied to the console:

screen shot 2019-02-21 at 1 30 58 pm

I didn't change the other interfaces (I thought I'd do that when I could connect via SSH). Any idea what I'm doing wrong?

rjschwei commented 5 years ago

After renaming the files, just run "ifup enx0025b51a0015" and "ifstatus enx0025b51a0015"

jeffaco commented 5 years ago

It's not coming up properly: ifup enx0025b51a0015 yields:

wicked: Rejecting suspect interface name: enx0025b51a0015.250
enx0025b51a0015 up

But then output of ip a is as above, and I don't have a network where I can ping the gateway. If I try to ping the gateway, I get: connect: Network is unreachable. Any suggestions?

rjschwei commented 5 years ago

I've reached out to the networking team

schaefi commented 5 years ago

Just saw your e-mail, let's wait for feedback from Marius. I haven't seen this error in my kvm based test which is interesting. The vlan id in my tests is '0' or '1'...

schaefi commented 5 years ago

Some refactoring in our network code is required due to a size limit in network interface names. I will adapt the code according to the thread we had with Marius

jeffaco commented 5 years ago

Hey, another issue came up with this test image:

This test image doesn't work with old YAML files. That is, if a YAML file specifies eth0 (for old behavior), it appears to still pick some sort of MAC address, and then doesn't come up for networking.

Old behavior should be consistent (with an old YAML file) to still work in Gen3. I'd like the new behavior (setting up interfaces like enx0025b51a0015) to only take place if I specify an interface like that in the YAML.

Is this possible?

schaefi commented 5 years ago

This test image doesn't work with old YAML files. That is, if a YAML file specifies eth0 (for old behavior), it appears to still pick some sort of MAC address, and then doesn't come up for networking. Old behavior should be consistent (with an old YAML file)

The yaml file is not the driver here. Whether or not you get eth0 vs. enx interface names is controlled by the kernel bootoption:

net.ifnames=1

The image we (Robert) sent you has this option enabled to get you started with mac based interface names. If you want to go back you need to pass:

net.ifnames=0

To the kernel when it boots. There is no way for us to control this on the yaml level.

I'm concerned that you consider to go back to non predictable names. Actually that would only be a safe choice if only one interface exists on the machine.

It would also be nice if we stay focused on the issue here, meaning dealing with multiple interfaces and assign them correctly. I'm in the process to create an image for you which includes fixes for all the reported issues and I hope testing on your side will be consistently using mac based setup. I'd like to get this fixed and tested. Once all is good we can talk about for which systems you want predictable interface names and for which you don't want them.

Makes sense ?

schaefi commented 5 years ago

125 is open for review and the last bit in the queue before a new image will be ready for testing. Let's make this to come true first

jeffaco commented 5 years ago

Makes sense, Marcus. The concern I have with predictable names is that I'm required to offer the MAC address. I think predictable names will likely work fine as is (without MAC addresses) on VLI systems, and regular network names (eth0, etc) always worked fine in Gen3 (which we still support).

Thus, my thought was this:

Platform Network Name Conventions
Gen3 Previous (working) config: eth0 ... ethx
Gen4 (LI/UCS) Predictable names with MAC addresses
Gen 4 (VLI) Predictable names without MAC addresses (enp...)

That's my thoughts. Any objections? If we do this, how can we specify the kernel boot parameters? Or will we have to use different images in these cases (I'd really prefer not to do that)?

rjschwei commented 5 years ago

Using predictable names on VLI based n location rather than MAC is not an issue.

jeffaco commented 5 years ago

Sure, but: how do we manage Gen3 platforms?

How can we specify the kernel boot parameters? Or will we have to use different images in Gen3 vs. Gen4 (I'd really prefer not to do that)?

jeffaco commented 5 years ago

Let's leave this issue opened until we completely understand how this will work in Gen3 (LI) and Gen4 (LI/VLI) given the above table ...

rjschwei commented 5 years ago

Gen3 LI will have to transition to the new MAC based scheme

jeffaco commented 5 years ago

Ooh, I'm not sure that's possible. Is that our only option? I'll need to take this to the team - we may end up not using this capability of that's the case.

Let me know if this is the only option, thanks. It would be super awesome if, somehow, the YAML could be used to determine the naming scheme ...

rjschwei commented 5 years ago

As @schaefi it is a kernel configuration option and has nothing to do with the YAML. What you are asking is to build separate images for Gen3 and Gen4, so double the image count for LI. Sorry that's not an option.

schaefi commented 5 years ago

Gen3 Previous (working) config: eth0 ... ethx

You could have been on a lucky path here so far. There is no guarantee that the order of interfaces is persistent between boot cycles of the machine. As I said ethX naming is imho only a safe option if there is only one interface available. As soon as there are more you are on an unstable path with eth naming as it is done by the order of the devices as they appear on the kernel, and that order is not guaranteed.

Gen3 LI will have to transition to the new MAC based scheme

From my point of view the only stable solution