danderson / netboot

Packages and utilities for network booting
Apache License 2.0
1.49k stars 181 forks source link

Boot fails confusingly when Pixiecore is competing with another PXE-enabled DHCP server #17

Open ryanmoon opened 7 years ago

ryanmoon commented 7 years ago

When using waitron and pixiecore in API mode in docker, I am able to see that waitron is able to place a machine in build mode, and then pixiecore talks to waitron and receives info, but the PXE boot process fails, and I receive a "Could not start download: Operation Not Supported" IPXE Error.

Waitron is returning:

{
  "kernel": "http://192.168.1.1/image/ubuntu-installer/amd64/linux",
  "initrd": [
    "http://192.168.1.1/image/ubuntu-installer/amd64/initrd.gz"
  ],
  "cmdline": "interface=auto url=http://127.0.0.1:9090/test.example.com/preseed/7ab940c8-8cb8-4f9b-8758-8aa25b2fa512 ramdisk_size=10800 root=/dev/rd/0 rw auto hostname=test.example.com console-setup/ask_detect=false console-setup/layout=USA console-setup/variant=USA keyboard-configuration/layoutcode=us localechooser/translation/warn-light=true localechooser/translation/warn-severe=true locale=en_US"
}

Which appears to be valid responses via the API.md, and the URLs for both the kernel and initird are valid working URLs.

danderson commented 7 years ago

Hmm, according to the internet, "Operation not supported" means that the iPXE doesn't have HTTP support built in... Which is weird, because I check for exactly that feature when managing the boot.

What is the machine you're trying to boot? Does its bios/uefi come with its own builtin version of ipxe (e.g. Oracle Virtualbox) ? I suspect I might be checking the iPXE feature flags incorrectly.

Can you get me a packet capture of the machine attempting to boot? Run sudo pixiecore debug tcpdump <interface> issue17.pcap in a separate terminal, then try to boot the machine again with Pixiecore/waitron. This will capture the network traffic seen by Pixiecore, so I can see exactly what the machine is trying to do. Please attach the pcap file to this bug.

ryanmoon commented 7 years ago

Hmm, I'm running this all in docker containers, so I got into the container with docker exec -it pixiecore /bin/sh but when I run ./pixiecore debug tcpdump eno1 issue17.pcap I receive: Running tcpdump: exec: "tcpdump": executable file not found in $PATH

sethamclean commented 7 years ago

I'm experiencing the same issue running from the docker container.

sethamclean commented 7 years ago

It looks like the change to disable PXE in the primary DHCP server was not applied. This caused iPXE to try to download pxelinux.0 over tftp. Running with an updated version of ipxe provided a clearer error message.

lae commented 7 years ago

I guess I've been having this issue as well, with the environment sethamclean describes. The DHCP server passes next-server and bootfile, but I'd like these to be ignored (as I don't have access to the DHCP server).

That environment works fine with a build of https://github.com/danderson/pixiecore/tree/256d6c7edd622dd18b4074c8895323cc4082c162 and I've been using this for months, but of course I lose out in improvements made since then.

I have a pcap here: https://up.lae.is/u/existing-dhcp-pxe-op-not-supported.pcap

danderson commented 7 years ago

Sorry for being slow to respond on this. It's something I definitely care about, although at a first glance, there's not much I can do to influence a client if it receives both a regular DHCP PXE response and a ProxyDHCP response. AFAICT, the client's reaction is completely implementation-defined in that case, it could either obey the main DHCP server, or Pixiecore, or a weird combination of both.

I'm digging into the pcaps you've provided now to see if anything jumps out at me. If nothing else, I should at least log a big warning when Pixiecore sees another PXE response on the network.

danderson commented 7 years ago

@lae Looking at your pcap, I don't see Pixiecore's traffic in there. For each request your PXE client sends, I see two responses coming from 10.11.108.2 and 10.11.108.3. Both of those look like DHCP relays that are responding on behalf of 10.11.10.19, which I'm guessing is the primary DHCP server on your network.

Again, I'm going to keep investigating, but AFAICT, the only thing I can do in this situation is to try and detect the multiple PXE servers, and log a warning in Pixiecore's logs that the boot will probably fail. The PXE specification doesn't give me enough control to override the other PXE server on the network :(.