jeremyd commented 7 years ago

Issue Report

Bug

We need to use encrypted root FS for compliance. The AWS copy tool is broken for CoreOS so we use a simple dd of the CoreOS snapshot. The steps are as follows to reproduce:

Use the latest CoreOS stable (as of today this is 1298.7.0).
Get the snapshot_id from the AMI description.
Launch an instance for doing a volume copy.
Create a volume from the snapshot (in same AZ)
Create a new 10g volume blank volume (in same AZ)
Attach both volumes to the instance
dd if=/dev/xvdf of=/dev/xvdg bs=4M && sync && sync
detach the encrypted volume
create an encrypted snapshot
create an encrypted ami from the encrypted snapshot using /dev/xvda as the root snapshot device.
boot the image and do some rkt operations. (Like try to run etcd3)

When doing rkt fetching on initial boot of a kube-aws cluster the rkt run command fails and the kernel shows a panic in the logs. Entire boot log is pasted here: https://gist.github.com/jeremyd/8f66bb0d508908a804edda510b70570f . Line 2270 is the kernel panics start. Line 2203 is also strange as it seems something is shutting down during the startup... Any help or pointers appreciated if we are doing something wrong.

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.7.0
VERSION_ID=1298.7.0
BUILD_ID=2017-03-31-0215
PRETTY_NAME="Container Linux by CoreOS 1298.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

AWS us-east-1

Expected Behavior

rkt should succeed to pull all images and startup the containers. encrypted root fs should work just like a regular ami.

Actual Behavior

unit files fail to start rkt after timeout. a second run, works. regular ami does not have these panics.

Reproduction Steps

The steps are as follows to reproduce:

Use the latest CoreOS stable (as of today this is 1298.7.0).
Get the snapshot_id from the AMI description.
Launch an instance for doing a volume copy.
Create a volume from the snapshot (in same AZ)
Create a new 10g volume blank+encrypted volume (in same AZ)
Attach both volumes to the instance
dd if=/dev/xvdf of=/dev/xvdg bs=4M && sync && sync
detach the encrypted volume
create an encrypted snapshot
create an encrypted ami from the encrypted snapshot using /dev/xvda as the root snapshot device.
boot the image and do some rkt operations. (Like try to run etcd3)
try to boot kube-aws with this ami (it does a lot of rkt during startup)

Environment

AWS us-east-1 HVM

bgilbert commented 7 years ago

The records starting at line 2203 are associated with a sudo command ending. The records starting at 2270 are not kernel panics; they're just reporting that some process (in this case, rkt) has been stuck in the kernel (in this case, waiting for I/O) for two minutes.

The logs show a lot of database connection timeouts, as well as some systemd unit timeouts caused by slow I/O. I didn't see evidence of other crashes. Do you reliably see this failure with the encrypted AMI? What instance type are you using?

jeremyd commented 7 years ago

I've tested using both c4.xlarge and m4.large. It seems to happen every time on the c4.xlarge and then 4/5 times on the m4.large. I'm currently trying to see if it's resizing the filesystem during this rkt fetch.. perhaps that would cause the IO freeze. The additional logs are likely from kube-aws trying the normal startup sequence for etcd, when using the same image unencrypted it boots successfully 100% of the time.

I also tried disabling md5 hashing with --insecure-options=ondisk. This helped a little bit but still failing 1/3 times. (Which prevents my etcd cluster from coming up)

jeremyd commented 7 years ago

I modified etcdadm-reconfigure.service to have After=extend-filesystems.service and my cluster booted! It also worked twice in a row so I'd say that's a good fix. I noticed the kernel panic timeouts still happen, however, rkt is able to finish it's job. (whereas before, it had 10 whole minutes before the unit times out and still couldn't finish).

jeremyd commented 7 years ago

Since this seems like more of a user configuration issue, closing. Might be worth seeing if the etcd docs mention this or if there's any way to help users trying to run rkt on bootup for *any use case when they hit this issue (it can be very hard to track down, and super annoying since it's intermittant). I know I've hit this problem (a lot less frequently) on un-encrypted images as well and now I know the cause!

jeremyd commented 7 years ago

Also worth noting, had to leave the --insecure-options=ondisk and keep the timeout at 600s for it to work. The previous (unencrypted) image is ok without these options and a 120s timeout.

jeremyd commented 7 years ago

Actually, I take it back. I thought things were fixed by adding the After=extend-filesystems.service but today I'm experiencing a failure rate > than I hoped. So far, I'm getting about 50% failure ... I noticed etcdadm-reconfigure calls /opt/bin/etcdadm which uses rkt. I might try adding --insecure-options=ondisk to this script helper... This seems like a bug with CoreOS + encryptedFS still so I'm reopening this.

crawford commented 7 years ago

We probably want to add extend-filesystems.service to basic.target. That way, all services (by default) will run after that is complete.

coreos / bugs

CoreOS encrypted RootFS kernel panics during rkt fetch #1908

Issue Report

Bug

Container Linux Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Environment