coreos / fedora-coreos-tracker

Issue tracker for Fedora CoreOS
https://fedoraproject.org/coreos/
264 stars 59 forks source link

DNS issues on Azure #356

Closed arithx closed 4 years ago

arithx commented 4 years ago

Issue Report

Bug

Fedora CoreOS Version

31.20200113.3.1

Expected Behavior

Working DNS

Actual Behavior

When spawning FCOS machines on Azure there is no DNS. The machines do seem to have working networking otherwise.

Reproduction Steps

  1. boot machine on Azure
  2. ping google.com

Other Information

I haven't managed to get a machine booted on Azure via manual spawning in the CLI or kola that have working DNS.

jlebon commented 4 years ago

That's odd. What does /etc/resolv.conf say? Logs from NetworkManager?

Hmm, but clearly this has to be working on RHCOS. One difference I can think of is that RHCOS does check-in from the initrd, though I don't think checking in would be related to DNS.

arithx commented 4 years ago

/etc/resolv.conf doesn't exist (likely because we aren't bringing down the networking in the initramfs like RHCOS is).

I've included /run/initramfs/state/etc/resolv.conf as well as the journal for NetworkManager (note that I did manually restart NetworkManager ealrier on via sudo systemctl restart NetworkManager to try to see if that resolved it)

[core@networktest ~]$ cat /etc/resolv.conf
cat: /etc/resolv.conf: No such file or directory
[core@networktest ~]$ ls /etc/
adjtime                        csh.cshrc                fedora-release  hosts          libnl           multipath          pkcs11            rpm               sssd                tmpfiles.d
aliases                        csh.login                filesystems     idmapd.conf    libreport       netconfig          pkgconfig         rpm-ostreed.conf  statetab.d          trusted-key.key
alternatives                   dbus-1                   fuse.conf       inittab        libssh          NetworkManager     pki               rsyncd.conf       subgid              udev
bash_completion.d              default                  gcrypt          inputrc        libuser.conf    networks           pm                rwtab.d           subgid-             virc
bashrc                         depmod.d                 gnupg           iproute2       login.defs      nfs.conf           polkit-1          samba             subuid              X11
bindresvport.blacklist         dhcp                     GREP_COLORS     iscsi          logrotate.conf  nfsmount.conf      popt.d            sasl2             subuid-             xattr.conf
binfmt.d                       DIR_COLORS               group           issue          logrotate.d     nftables           prelink.conf.d    security          sudoers             xdg
chrony.conf                    DIR_COLORS.256color      group-          issue.d        lvm             nsswitch.conf      printcap          selinux           sudoers.d           yum.repos.d
chrony.keys                    DIR_COLORS.lightbgcolor  grub2.cfg       issue.net      machine-id      nsswitch.conf.bak  profile           services          swid                zincati
cifs-utils                     dnf                      grub2-efi.cfg   kernel         magic           openldap           profile.d         sestatus.conf     sysconfig
cni                            dracut.conf              grub.d          krb5.conf      mke2fs.conf     opt                protocols         shadow            sysctl.conf
console-login-helper-messages  dracut.conf.d            gshadow         krb5.conf.d    modprobe.d      os-release         rc.d              shadow-           sysctl.d
containerd                     environment              gshadow-        ld.so.cache    modules-load.d  ostree             redhat-release    shells            systemd
containers                     ethertypes               gss             ld.so.conf     motd            pam.d              request-key.conf  skel              system-release
cron.d                         exports                  host.conf       ld.so.conf.d   motd.d          passwd             request-key.d     ssh               system-release-cpe
crypto-policies                fedora-coreos-pinger     hostname        libaudit.conf  mtab            passwd-            rpc               ssl               terminfo
[core@networktest ~]$ cat /run/initramfs/state/etc/resolv.conf 
nameserver 168.63.129.16
search  u5e2tmrol1sebjifwcberhsgzf.dx.internal.cloudapp.net
[core@networktest ~]$ journalctl -t NetworkManager --no-pager
-- Logs begin at Tue 2020-01-28 20:03:24 UTC, end at Tue 2020-01-28 21:28:33 UTC. --
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.0456] NetworkManager (version 1.20.8-1.fc31) is starting... (for the first time)
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.0459] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-disable-default-plugins.conf, 20-client-id-from-mac.conf) (run: 10-dracut-dhclient.conf)
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.6094] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Jan 28 20:04:35 networktest NetworkManager[1003]: <info>  [1580241875.6509] manager[0x56149f4c0130]: monitoring kernel firmware directory '/lib/firmware'.
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.4818] hostname: hostname: using hostnamed
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.4818] hostname: hostname changed from (none) to "networktest"
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.4824] dns-mgr[0x56149f4a3240]: init: dns=default,systemd-resolved rc-manager=symlink
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.5263] manager[0x56149f4c0130]: rfkill: Wi-Fi hardware radio set enabled
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.5264] manager[0x56149f4c0130]: rfkill: WWAN hardware radio set enabled
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6368] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6369] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6370] manager: Networking is enabled by state file
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6724] dhcp-init: Using DHCP client 'dhclient'
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6725] settings: Loaded settings plugin: keyfile (internal)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6900] device (lo): carrier: link connected
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6903] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6910] device (eth0): carrier: link connected
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.6914] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7868] settings: (eth0): created default wired connection 'Wired connection 1'
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7913] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7922] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.7931] device (eth0): Activation: starting connection 'eth0' (4a30bc0c-48d3-49d2-a508-2edd429eaba7)
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8076] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8080] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8083] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8085] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8195] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8197] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8200] manager: NetworkManager state is now CONNECTED_LOCAL
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8207] device (eth0): Activation: successful, device activated.
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8212] manager: NetworkManager state is now CONNECTED_GLOBAL
Jan 28 20:04:37 networktest NetworkManager[1003]: <info>  [1580241877.8215] manager: startup complete
Jan 28 20:06:43 networktest NetworkManager[1003]: <info>  [1580242003.3939] caught SIGTERM, shutting down normally.
Jan 28 20:06:43 networktest NetworkManager[1003]: <info>  [1580242003.3953] manager: NetworkManager state is now CONNECTED_LOCAL
Jan 28 20:06:43 networktest NetworkManager[1003]: <info>  [1580242003.4758] exiting (success)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5124] NetworkManager (version 1.20.8-1.fc31) is starting... (after a restart)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5125] Read config: /etc/NetworkManager/NetworkManager.conf (lib: 10-disable-default-plugins.conf, 20-client-id-from-mac.conf) (run: 10-dracut-dhclient.conf)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5201] bus-manager: acquired D-Bus service "org.freedesktop.NetworkManager"
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.5350] manager[0x555befdfe130]: monitoring kernel firmware directory '/lib/firmware'.
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8577] hostname: hostname: using hostnamed
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8578] hostname: hostname changed from (none) to "networktest"
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8580] dns-mgr[0x555befde3240]: init: dns=default,systemd-resolved rc-manager=symlink
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8583] manager[0x555befdfe130]: rfkill: Wi-Fi hardware radio set enabled
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8583] manager[0x555befdfe130]: rfkill: WWAN hardware radio set enabled
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8597] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8598] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8599] manager: Networking is enabled by state file
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8600] dhcp-init: Using DHCP client 'dhclient'
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8601] settings: Loaded settings plugin: keyfile (internal)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8615] device (lo): carrier: link connected
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8617] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8624] device (eth0): carrier: link connected
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8628] manager: (eth0): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8648] device (eth0): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8657] device (eth0): state change: unavailable -> disconnected (reason 'connection-assumed', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8665] device (eth0): Activation: starting connection 'eth0' (794b0119-9912-453b-b991-246d38a41599)
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8678] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8681] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8684] device (eth0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8686] device (eth0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8831] device (eth0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8833] device (eth0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8836] manager: NetworkManager state is now CONNECTED_LOCAL
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8841] device (eth0): Activation: successful, device activated.
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8845] manager: NetworkManager state is now CONNECTED_GLOBAL
Jan 28 20:06:43 networktest NetworkManager[2116]: <info>  [1580242003.8847] manager: startup complete
jlebon commented 4 years ago

likely because we aren't bringing down the networking in the initramfs like RHCOS is

Ahhh hmm yeah, that's a big delta. I don't quite remember now why we don't do this in FCOS too. Maybe we expected it to be unnecessary with the switch to NM in the initrd?

jomeier commented 4 years ago

https://github.com/coreos/fedora-coreos-tracker/issues/148#issuecomment-565830139

Thats a showstopper because we have to reboot each machine a few times manually before it has an internet connection.

Is it a big task to resolve this in the official fcos image?

lucab commented 4 years ago

@jlebon I think https://github.com/coreos/ignition-dracut/issues/119 is related.

jomeier commented 4 years ago

Hi folks. Do you have any updates for this one?

LorbusChris commented 4 years ago

xref: https://github.com/coreos/ignition-dracut/issues/119 and https://github.com/dracutdevs/dracut/issues/694

dustymabe commented 4 years ago

cross referencing this with https://github.com/coreos/fedora-coreos-tracker/issues/394

jomeier commented 4 years ago

https://bugzilla.redhat.com/show_bug.cgi?id=1809641

simongottschlag commented 4 years ago

Hitting this issue as well. Any updates?

dustymabe commented 4 years ago

yes, once https://github.com/coreos/ignition-dracut/pull/159 and https://github.com/coreos/fedora-coreos-config/pull/310 are merged and into a release we think this should be taken care of.

simongottschlag commented 4 years ago

Hi,

FYI, I used this in the ignition to work around the issue. Seems to be working:

systemd:
  units:
    - name: azure-restart-network.service
      enabled: true
      contents: |
        [Service]
        Type=oneshot
        ExecStart=/bin/bash -c '\
          /usr/bin/cp /run/initramfs/state/etc/resolv.conf /etc/resolv.conf; \
          /usr/bin/systemctl restart NetworkManager'

        [Install]
        WantedBy=multi-user.target
dustymabe commented 4 years ago

@jomeier @simongottschlag - care to test https://builds.coreos.fedoraproject.org/prod/streams/testing-devel/builds/31.20200323.20.0/x86_64/fedora-coreos-31.20200323.20.0-azure.x86_64.vhd.xz to see if that fixes the problem?

simongottschlag commented 4 years ago

@jomeier @simongottschlag - care to test https://builds.coreos.fedoraproject.org/prod/streams/testing-devel/builds/31.20200323.20.0/x86_64/fedora-coreos-31.20200323.20.0-azure.x86_64.vhd.xz to see if that fixes the problem?

I'm having issues deploying our production VMs right now (capacity in West Europe), meaning I need to prioritise that before tests. Sorry!

jomeier commented 4 years ago

Strike!

I will try that out today. Give me a few hours, please.

jomeier commented 4 years ago

@dustymabe @vrutkovs @LorbusChris

Ok guys ... it looks good.

I installed OKD 4.4 successfully without manual interaction from my side. Everything is green in the web ui -> ok.

For your information: I had to resize, convert and upload the FCOS test image to Azure but I'm sure thats expected behaviour for this test. I used a helper VM which I patched in the OKD installer which did the work.

Good job !

dustymabe commented 4 years ago

We are now using NetworkManager in the initramfs and also propagating network information from the initramfs (kargs) when appropriate, which we think fixes this issue.

See https://github.com/coreos/fedora-coreos-tracker/issues/394#issuecomment-604598128 and the preceding discussion for more details.