coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

Unable to boot CoreOS using Vagrant #382

Closed bcwaldon closed 9 years ago

bcwaldon commented 9 years ago

Using the coreos-vagrant project, I attempt to bring up a single machine without a cloud-config:

% vagrant up
Bringing machine 'core-01' up with 'virtualbox' provider...
==> core-01: Importing base box 'coreos-alpha'...
==> core-01: Matching MAC address for NAT networking...
==> core-01: Checking if box 'coreos-alpha' is up to date...
==> core-01: Setting the name of the VM: coreos-vagrant_core-01_1434050350452_32856
==> core-01: Fixed port collision for 22 => 2222. Now on port 2200.
==> core-01: Clearing any previously set network interfaces...
==> core-01: Preparing network interfaces based on configuration...
    core-01: Adapter 1: nat
    core-01: Adapter 2: hostonly
==> core-01: Forwarding ports...
    core-01: 22 => 2200 (adapter 1)
==> core-01: Running 'pre-boot' VM customizations...
==> core-01: Booting VM...
==> core-01: Waiting for machine to boot. This may take a few minutes...
    core-01: SSH address: 127.0.0.1:2200
    core-01: SSH username: core
    core-01: SSH auth method: private key
    core-01: Warning: Connection timeout. Retrying...
==> core-01: Machine booted and ready!
==> core-01: Setting hostname...
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!

systemctl start system-cloudinit@-var-tmp-hostname.yml.service

Stdout from the command:

Stderr from the command:

Job for system-cloudinit@-var-tmp-hostname.yml.service failed because a configured resource limit was exceeded. See "systemctl status system-cloudinit@-var-tmp-hostname.yml.service" and "journalctl -xe" for details.

Checking on the failed units:

$ journalctl -u "system-cloudinit@-var-tmp-hostname.yml.service" -l -o cat
[/usr/lib64/systemd/system/system-cloudinit@.service:2] Failed to resolve unit specifiers on Load cloud-config from %f, ignoring:
[/usr/lib64/systemd/system/system-cloudinit@.service:6] Failed to resolve specifiers, ignoring: %f
system-cloudinit@-var-tmp-hostname.yml.service: Failed to run 'start' task: Invalid argument
Failed to start system-cloudinit@-var-tmp-hostname.yml.service.
system-cloudinit@-var-tmp-hostname.yml.service: Unit entered failed state.
system-cloudinit@-var-tmp-hostname.yml.service: Failed with result 'resources'.
Starting system-cloudinit@-var-tmp-hostname.yml.service...

And the unit that failed:

$ systemctl cat system-cloudinit@.service
# /usr/lib64/systemd/system/system-cloudinit@.service
[Unit]
Description=Load cloud-config from %f
Requires=dbus.service
After=dbus.service
Before=system-config.target
ConditionFileNotEmpty=%f

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/coreos-cloudinit --from-file=%f
$ cat /var/tmp/hostname.yml
#cloud-config

hostname: core-01
crawford commented 9 years ago

Where does system-cloudinit@-var-tmp-hostname.yml.service come from? That looks malformed (the leading '-').

bcwaldon commented 9 years ago

@crawford https://github.com/coreos/coreos-overlay/blob/master/coreos-base/oem-vagrant/files/box/change_host_name.rb#L31

bcwaldon commented 9 years ago

@crawford the leading hyphen is the problem, but that file hasn't changed in over a year. This sounds like a regression in systemd 220...

bcwaldon commented 9 years ago

Test i ran:

$ systemctl cat tmp@.service
# /run/systemd/system/tmp@.service
[Unit]
Description=Test %f
ConditionFileNotEmpty=%f

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/echo decoded file is %f

$ touch /home/core/foo

$ sudo systemctl start tmp@-home-core-foo.service
Job for tmp@-home-core-foo.service failed because a configured resource limit was exceeded. See "systemctl status tmp@-home-core-foo.service" and "journalctl -xe" for details.

$ systemctl status -l "tmp@-home-core-foo.service"
● tmp@-home-core-foo.service
   Loaded: loaded (/run/systemd/system/tmp@.service; static; vendor preset: disabled)
   Active: failed (Result: resources)

Jun 11 19:30:18 localhost systemd[1]: [/run/systemd/system/tmp@.service:3] Failed to resolve specifiers, ignoring: %f
Jun 11 19:30:18 localhost systemd[1]: [/run/systemd/system/tmp@.service:2] Failed to resolve unit specifiers on Test %f, ignoring: Invalid argument
Jun 11 19:30:18 localhost systemd[1]: [/run/systemd/system/tmp@.service:3] Failed to resolve specifiers, ignoring: %f
Jun 11 19:30:21 localhost systemd[1]: [/run/systemd/system/tmp@.service:2] Failed to resolve unit specifiers on Test %f, ignoring: Invalid argument
Jun 11 19:30:21 localhost systemd[1]: [/run/systemd/system/tmp@.service:3] Failed to resolve specifiers, ignoring: %f
Jun 11 19:30:21 localhost systemd[1]: tmp@-home-core-foo.service: Failed to run 'start' task: Invalid argument
Jun 11 19:30:21 localhost systemd[1]: Failed to start tmp@-home-core-foo.service.
Jun 11 19:30:21 localhost systemd[1]: tmp@-home-core-foo.service: Unit entered failed state.
Jun 11 19:30:21 localhost systemd[1]: tmp@-home-core-foo.service: Failed with result 'resources'.
Jun 11 19:30:21 localhost systemd[1]: Starting tmp@-home-core-foo.service...

$ sudo systemctl start tmp@home-core-foo.service

$ systemctl status "tmp@home-core-foo.service"
● tmp@home-core-foo.service - Test /home/core/foo
   Loaded: loaded (/run/systemd/system/tmp@.service; static; vendor preset: disabled)
   Active: inactive (dead)

Jun 11 19:30:31 localhost systemd[1]: Started Test /home/core/foo.
crawford commented 9 years ago

https://github.com/coreos/coreos-overlay/pull/1280

bcwaldon commented 9 years ago

@crawford is there an upstream bug to file?

crawford commented 9 years ago

No, this is not a regression in systemd. This is well documented behavior that we've been lucky ignoring.

The root directory "/" is encoded as single dash, while otherwise the initial and ending "/" are removed from all paths during transformation.

http://www.freedesktop.org/software/systemd/man/systemd.unit.html

bcwaldon commented 9 years ago

@crawford I don't quite understand how our vagrant box worked before this bug cropped up...

crawford commented 9 years ago

We got lucky. systemd accepted our malformed path and now they don't.

bcwaldon commented 9 years ago

Are you saying that the behavior we relied on was undocumented and in the recent relesae that behavior was corrected?

bcwaldon commented 9 years ago

got it.

crawford commented 9 years ago

@bcwaldon can you help me test this image? :pray: Vagrant doesn't run on my system.

https://users.developer.core-os.net/crawford/vagrant/coreos_production_vagrant.json https://users.developer.core-os.net/crawford/vagrant/coreos_production_vagrant.box Note: these are developer images whose 'core' password is "password".

bcwaldon commented 9 years ago

@crawford worked for me w/ the additional fix for /var/tmp/networks.yml :shipit:

crawford commented 9 years ago

Fixed by https://github.com/coreos/coreos-overlay/pull/1280.

pplanel commented 9 years ago

Any update on this? This issue is holding me back on learning!! There is a way to change the Vagrantfile to get a older image?

Maybe here?

config.vm.box = "coreos-%s" % $update_channel
config.vm.box_version = ">= 308.0.1"
config.vm.box_url = "http://%s.release.core-os.net/amd64-usr/current/coreos_production_vagrant.json" % $update_channel

[edit] Yep, right there, need some modification. Is there a reason that config.vm.box should take conditional operators? I've changed the box_url to take the version too instead of current.

In case anyone need this:

config.vm.box = "coreos-%s" % $update_channel
config.vm.box_version = "647.2.0"
config.vm.box_url = "http://%s.release.core-os.net/amd64-usr/%s/coreos_production_vagrant.json" % [$update_channel, config.vm.box_version]

tks

crawford commented 9 years ago

@pplanel this would be a great addition to https://github.com/coreos/coreos-vagrant. Would you like to submit a pull request?

pplanel commented 9 years ago

Sure

lopsch commented 9 years ago

@pplanel maybe externalized in config.rb.sample with current as default value :-)?

pplanel commented 9 years ago

@lopsch Yes, check my PR: https://github.com/coreos/coreos-vagrant/pull/243

lopsch commented 9 years ago

Cool, sorry my first steps in the github community ;-).

piclez commented 9 years ago

Great finding guys, thanks for fixing this.

mattma commented 9 years ago

As of today, July 22, the fresh clone of the project still run into this issue. After reading some of the issues related with this one, I switched to beta@723.3.0 which is the latest.

It seems all fine except that I run into an issue of iptables when I starts up a container.

Cannot start container 42d87943d3149b559bd8535e0c610096a1ab2d084d18048b0ac8215d0a0b83fd: [8] System error: not a directory

Dive in a little deeper and found out

W0723 04:14:29.876018       1 server.go:86] Failed to start in resource-only container "/XXXXXX": mountpoint for cgroup not found
F0723 04:14:29.876171       1 server.go:101] Unable to create proxer: failed to initialize iptables: error creating chain "XXXX-PORTALS-CONTAINER": exec: "iptables": executable file not found in $PATH:

Any ideas? @crawford

crawford commented 9 years ago

@mattma something must be failing in the cloning process. What is the reported version of CoreOS in the image that is failing (cat /etc/os-release)? You'll probably need to use a console connection to the machine since SSH isn't going to work. You can set the coreos.autologin kernel parameter to get around the password prompt.

mattma commented 9 years ago

@crawford

I am able to start vagrant up and vagrant ssh.

 ➜ vagrant ssh                                                                                                                               ✹
Last login: Thu Jul 23 04:13:47 2015 from 10.0.2.2
CoreOS alpha (745.1.0)
core@core-01 ~ $ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=745.1.0
VERSION_ID=745.1.0
BUILD_ID=
PRETTY_NAME="CoreOS 745.1.0"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"