coreos / bugs

Issue tracker for CoreOS Container Linux
https://coreos.com/os/eol/
147 stars 30 forks source link

CoreOS encrypted RootFS kernel panics during rkt fetch #1908

Open jeremyd opened 7 years ago

jeremyd commented 7 years ago

Issue Report

Bug

We need to use encrypted root FS for compliance. The AWS copy tool is broken for CoreOS so we use a simple dd of the CoreOS snapshot. The steps are as follows to reproduce:

When doing rkt fetching on initial boot of a kube-aws cluster the rkt run command fails and the kernel shows a panic in the logs. Entire boot log is pasted here: https://gist.github.com/jeremyd/8f66bb0d508908a804edda510b70570f . Line 2270 is the kernel panics start. Line 2203 is also strange as it seems something is shutting down during the startup... Any help or pointers appreciated if we are doing something wrong.

Container Linux Version

NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1298.7.0
VERSION_ID=1298.7.0
BUILD_ID=2017-03-31-0215
PRETTY_NAME="Container Linux by CoreOS 1298.7.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"

Environment

AWS us-east-1

Expected Behavior

rkt should succeed to pull all images and startup the containers. encrypted root fs should work just like a regular ami.

Actual Behavior

unit files fail to start rkt after timeout. a second run, works. regular ami does not have these panics.

Reproduction Steps

The steps are as follows to reproduce:

Environment

AWS us-east-1 HVM

bgilbert commented 7 years ago

The records starting at line 2203 are associated with a sudo command ending. The records starting at 2270 are not kernel panics; they're just reporting that some process (in this case, rkt) has been stuck in the kernel (in this case, waiting for I/O) for two minutes.

The logs show a lot of database connection timeouts, as well as some systemd unit timeouts caused by slow I/O. I didn't see evidence of other crashes. Do you reliably see this failure with the encrypted AMI? What instance type are you using?

jeremyd commented 7 years ago

I've tested using both c4.xlarge and m4.large. It seems to happen every time on the c4.xlarge and then 4/5 times on the m4.large. I'm currently trying to see if it's resizing the filesystem during this rkt fetch.. perhaps that would cause the IO freeze. The additional logs are likely from kube-aws trying the normal startup sequence for etcd, when using the same image unencrypted it boots successfully 100% of the time.

I also tried disabling md5 hashing with --insecure-options=ondisk. This helped a little bit but still failing 1/3 times. (Which prevents my etcd cluster from coming up)

jeremyd commented 7 years ago

I modified etcdadm-reconfigure.service to have After=extend-filesystems.service and my cluster booted! It also worked twice in a row so I'd say that's a good fix. I noticed the kernel panic timeouts still happen, however, rkt is able to finish it's job. (whereas before, it had 10 whole minutes before the unit times out and still couldn't finish).

jeremyd commented 7 years ago

Since this seems like more of a user configuration issue, closing. Might be worth seeing if the etcd docs mention this or if there's any way to help users trying to run rkt on bootup for *any use case when they hit this issue (it can be very hard to track down, and super annoying since it's intermittant). I know I've hit this problem (a lot less frequently) on un-encrypted images as well and now I know the cause!

jeremyd commented 7 years ago

Also worth noting, had to leave the --insecure-options=ondisk and keep the timeout at 600s for it to work. The previous (unencrypted) image is ok without these options and a 120s timeout.

jeremyd commented 7 years ago

Actually, I take it back. I thought things were fixed by adding the After=extend-filesystems.service but today I'm experiencing a failure rate > than I hoped. So far, I'm getting about 50% failure ... I noticed etcdadm-reconfigure calls /opt/bin/etcdadm which uses rkt. I might try adding --insecure-options=ondisk to this script helper... This seems like a bug with CoreOS + encryptedFS still so I'm reopening this.

crawford commented 7 years ago

We probably want to add extend-filesystems.service to basic.target. That way, all services (by default) will run after that is complete.