Open jeremyd opened 7 years ago
The records starting at line 2203 are associated with a sudo
command ending. The records starting at 2270 are not kernel panics; they're just reporting that some process (in this case, rkt) has been stuck in the kernel (in this case, waiting for I/O) for two minutes.
The logs show a lot of database connection timeouts, as well as some systemd unit timeouts caused by slow I/O. I didn't see evidence of other crashes. Do you reliably see this failure with the encrypted AMI? What instance type are you using?
I've tested using both c4.xlarge and m4.large. It seems to happen every time on the c4.xlarge and then 4/5 times on the m4.large. I'm currently trying to see if it's resizing the filesystem during this rkt fetch.. perhaps that would cause the IO freeze. The additional logs are likely from kube-aws trying the normal startup sequence for etcd, when using the same image unencrypted it boots successfully 100% of the time.
I also tried disabling md5 hashing with --insecure-options=ondisk. This helped a little bit but still failing 1/3 times. (Which prevents my etcd cluster from coming up)
I modified etcdadm-reconfigure.service to have After=extend-filesystems.service and my cluster booted! It also worked twice in a row so I'd say that's a good fix. I noticed the kernel panic timeouts still happen, however, rkt is able to finish it's job. (whereas before, it had 10 whole minutes before the unit times out and still couldn't finish).
Since this seems like more of a user configuration issue, closing. Might be worth seeing if the etcd docs mention this or if there's any way to help users trying to run rkt on bootup for *any use case when they hit this issue (it can be very hard to track down, and super annoying since it's intermittant). I know I've hit this problem (a lot less frequently) on un-encrypted images as well and now I know the cause!
Also worth noting, had to leave the --insecure-options=ondisk and keep the timeout at 600s for it to work. The previous (unencrypted) image is ok without these options and a 120s timeout.
Actually, I take it back. I thought things were fixed by adding the After=extend-filesystems.service but today I'm experiencing a failure rate > than I hoped. So far, I'm getting about 50% failure ... I noticed etcdadm-reconfigure calls /opt/bin/etcdadm which uses rkt. I might try adding --insecure-options=ondisk to this script helper... This seems like a bug with CoreOS + encryptedFS still so I'm reopening this.
We probably want to add extend-filesystems.service to basic.target. That way, all services (by default) will run after that is complete.
Issue Report
Bug
We need to use encrypted root FS for compliance. The AWS copy tool is broken for CoreOS so we use a simple dd of the CoreOS snapshot. The steps are as follows to reproduce:
When doing rkt fetching on initial boot of a kube-aws cluster the rkt run command fails and the kernel shows a panic in the logs. Entire boot log is pasted here: https://gist.github.com/jeremyd/8f66bb0d508908a804edda510b70570f . Line 2270 is the kernel panics start. Line 2203 is also strange as it seems something is shutting down during the startup... Any help or pointers appreciated if we are doing something wrong.
Container Linux Version
Environment
AWS us-east-1
Expected Behavior
rkt should succeed to pull all images and startup the containers. encrypted root fs should work just like a regular ami.
Actual Behavior
unit files fail to start rkt after timeout. a second run, works. regular ami does not have these panics.
Reproduction Steps
The steps are as follows to reproduce:
Environment
AWS us-east-1 HVM