coreos / docs

Documentation for CoreOS projects
http://coreos.com/docs
Apache License 2.0
882 stars 534 forks source link

Disable Update #1234

Closed hapnermw closed 6 years ago

hapnermw commented 6 years ago

The CoreOS ignition docs say to disable update by using ignition to mask the update-engine and locksmithd services.

On EC2, I'm creating a CoreOS stable instance with the REBOOT_STRATEGY update strategy.

I use 'update_engine_client -update' to force upgrade to the latest stable version.

After the instance reboots, I'm attempting to disable upgrade by masking the update-engine locksmithd services. Since I've already created the instance, I'm using systemd to do this:

sudo systemctl mask update-engine locksmithd

After doing this, CoreOS fails to boot.

Is there a way to disable CoreOS update on an existing instance that was initially created with update enabled?

The docs for disabling update should be clarified to note that using ignition to mask update-engine locksmithd is a special case that is not equivalent to using systemd mask once an instance has been created.

bgilbert commented 6 years ago

Masking update-engine and locksmithd is generally safe on a running instance. In what way does your system fail to boot? Are there error messages?

Note that update-engine is responsible for marking the current boot successful, which it does 45 seconds after it starts up. If you apply an update and then immediately disable update-engine, the update will never be marked good, and the second reboot will boot into the old OS version. You can work around that by running coreos-setgoodroot yourself after rebooting into the new OS version.

hapnermw commented 6 years ago

Benjamin, thanks for the quick response and info.

I'll take note of the 45 second 'marked good' interval. In this case, there were several minutes between the update completing and the mask so it likely did mark the update as good.

There is no error message. The instance does not boot sufficiently to SSH to it.

The instance seemed to be fine prior to reboot.

EC2 says the instance is running but one of the two EC2 liveness tests fail.

I haven't yet tried using ignition to create the instance with upgrade disabled which I'll try in a few hours.

On Fri, May 18, 2018 at 10:59 AM, Benjamin Gilbert <notifications@github.com

wrote:

Masking update-engine and locksmithd is generally safe on a running instance. In what way does your system fail to boot? Are there error messages?

Note that update-engine is responsible for marking the current boot successful, which it does 45 seconds after it starts up. If you apply an update and then immediately disable update-engine, the update will never be marked good, and the second reboot will boot into the old OS version. You can work around that by running coreos-setgoodroot yourself after rebooting into the new OS version.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/coreos/docs/issues/1234#issuecomment-390285369, or mute the thread https://github.com/notifications/unsubscribe-auth/AAySh-aaztVvoBX4iodQUc9yn0SMARV8ks5tzwv2gaJpZM4UFHr0 .

bgilbert commented 6 years ago

Could you check the machine's console log after it fails to reboot? Because of the way EC2 works, you may need to wait 5+ minutes after the reboot before retrieving the log.

hapnermw commented 6 years ago

I tried creating the instance and masking without an upgrade. It rebooted with no problem.

I'll try with and upgrade prior to masking.

On Fri, May 18, 2018 at 11:56 AM, Benjamin Gilbert <notifications@github.com

wrote:

Could you check the machine's console log https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html after it fails to reboot? Because of the way EC2 works, you may need to wait 5+ minutes after the reboot before retrieving the log.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/coreos/docs/issues/1234#issuecomment-390300941, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyShw7WVPDJuEd0RtFQfJVQoqCOCQffks5tzxlngaJpZM4UFHr0 .

hapnermw commented 6 years ago

For reboot, I've been using the EC2 console instead of waiting for the update reboot.

This time, after the upgrade I executed coreos-setgoodroot; exited; and EC2 rebooted. The instance rebooted in the old CoreOS version.

I did the upgrade, EC2 rebooted and then did the mask. The masked instance rebooted without an issue.

Although I reproduced this problem several times prior to submitting the issue I can't seem to reproduce it now.

I suspect it had something to do with doing the upgrade and the mask prior to rebooting to the updated version. I will wait to do the mask until after the update reboot and all will likely work.

Please close this issue. Thanks again for your help.

On Fri, May 18, 2018 at 12:13 PM, Mark Hapner hapnermw@gmail.com wrote:

I tried creating the instance and masking without an upgrade. It rebooted with no problem.

I'll try with and upgrade prior to masking.

On Fri, May 18, 2018 at 11:56 AM, Benjamin Gilbert < notifications@github.com> wrote:

Could you check the machine's console log https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-console.html after it fails to reboot? Because of the way EC2 works, you may need to wait 5+ minutes after the reboot before retrieving the log.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/coreos/docs/issues/1234#issuecomment-390300941, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyShw7WVPDJuEd0RtFQfJVQoqCOCQffks5tzxlngaJpZM4UFHr0 .

bgilbert commented 6 years ago

You should be able to mask at any time, so the problem you're having is mysterious to me. Glad you got it working, though. Note that you'll need to execute coreos-setgoodroot after the reboot and not before.

hapnermw commented 6 years ago

I think I've just run into a related issue:

Container Linux by CoreOS stable (1688.5.3)

Update Strategy: No Reboots

Failed Units: 1

oem-cloudinit.service

core@ip-10-0-6-181 ~ $ journalctl -f -u oem-cloudinit

-- Logs begin at Fri 2018-05-18 20:03:54 UTC. --

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Writing file to "/etc/environment"

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Wrote file to "/etc/environment"

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Updated /etc/environment

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Ensuring runtime unit file "etcd.service" is unmasked

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Ensuring runtime unit file "etcd2.service" is unmasked

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Ensuring runtime unit file "fleet.service" is unmasked

May 18 20:23:15 ip-10-0-6-181.ec2.internal coreos-cloudinit[672]: 2018/05/18 20:23:15 Ensuring runtime unit file "locksmithd.service" is unmasked

May 18 20:23:15 ip-10-0-6-181.ec2.internal systemd[1]: oem-cloudinit.service: Main process exited, code=exited, status=1/FAILURE

May 18 20:23:15 ip-10-0-6-181.ec2.internal systemd[1]: oem-cloudinit.service: Failed with result 'exit-code'.

May 18 20:23:15 ip-10-0-6-181.ec2.internal systemd[1]: Failed to start Cloudinit from platform metadata.

It seems that the EC2 oem-cloudinit.service doesn't like locksmithd.service being masked.

On Fri, May 18, 2018 at 12:57 PM, Benjamin Gilbert <notifications@github.com

wrote:

You should be able to mask at any time, so the problem you're having is mysterious to me. Glad you got it working, though. Note that you'll need to execute coreos-setgoodroot after the reboot and not before.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/coreos/docs/issues/1234#issuecomment-390315855, or mute the thread https://github.com/notifications/unsubscribe-auth/AAySh1B0w1GOZ_f7UYv2kN4d83-xjSexks5tzye7gaJpZM4UFHr0 .

hapnermw commented 6 years ago

The instance with the oem-cloudinit.service issue was created without an ignition descriptor. When created with an ignition descriptor, I see that oem-cloudinit.service isn't started. So, this isn't and issue when ignition-started instances mask update.

On Fri, May 18, 2018 at 12:57 PM, Benjamin Gilbert <notifications@github.com

wrote:

Closed #1234 https://github.com/coreos/docs/issues/1234.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/coreos/docs/issues/1234#event-1635040691, or mute the thread https://github.com/notifications/unsubscribe-auth/AAySh1B0w1GOZ_f7UYv2kN4d83-xjSexks5tzye7gaJpZM4UFHr0 .