canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
3.02k stars 889 forks source link

cloud-init sometimes fails on dpkg lock due to concurrent apt-daily.service execution #2908

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1693361

Launchpad details
affected_projects = ['apt', 'apt (Ubuntu)', 'cloud-init (Ubuntu)', 'cloud-init (Ubuntu Xenial)', 'cloud-init (Ubuntu Yakkety)', 'cloud-init (Ubuntu Zesty)', 'cloud-init (Ubuntu Artful)']
assignee = None
assignee_name = None
date_closed = 2017-09-23T02:33:26.811971+00:00
date_created = 2017-05-24T21:10:37.007863+00:00
date_fix_committed = 2017-09-23T02:33:26.811971+00:00
date_fix_released = 2017-09-23T02:33:26.811971+00:00
id = 1693361
importance = medium
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1693361
milestone = None
owner = jbrowne
owner_name = Jim Browne
private = False
status = fix_released
submitter = jbrowne
submitter_name = Jim Browne
tags = ['verification-done-xenial', 'verification-done-yakkety', 'verification-done-zesty']
duplicates = [1686454, 1695033]

Launchpad user Jim Browne(jbrowne) wrote on 2017-05-24T21:10:37.007863+00:00

=== Begin SRU Template === [Impact] A cloud-config that contains packages to install (see below) or 'package_upgrade' will run 'apt-get update'. That can sometimes fail as a result of contention with the apt-daily.service that updates that information.

Cloud-config showing the problem is just like:

  $ cat my.yaml   #cloud-config   packages: ['hello']

[Test Case] lxc-proposed-snapshot is   https://git.launchpad.net/~smoser/cloud-init/+git/sru-info/tree/bin/lxc-proposed-snapshot It publishes an image to lxd with proposed enabled and cloud-init upgraded.

a.) launch an instance with proposed version of cloud-init and some user-data.    This is platform independent. The test case demonstrates lxd.    $ printf "%s\n%s\n%s\n" "#cloud-config" "packages: ['hello']" \        "package_upgrade: true" > config.yaml    $ release=xenial    $ ref=proposed-$release    $ ./lxc-proposed-snapshot --proposed --publish $release $ref;

b.) start the instance    $ name=$release-1693361    $ lxc launch my-xenial "--config=user.user-data=$(cat config.yaml)    $ sleep 1    $ lxc exec $name -- tail -f /var/log/cloud-init.log /var/log/cloud-init-output.log    # watch this boot.

 c.) Look for evidence of systemd failure    journalctl -o short-precise | grep -i break    journalctl -o short-precise | grep -i order

[Regression Potential] Regression chance here is low. Its possible that ordering loops could occur. When that does happen, journalctl will mention it. Unfortunately in such cases systemd somewhat randomly picks a service to kil so behavior is somewhat undefined.

[Other Info] Upstream commit at   https://git.launchpad.net/cloud-init/commit/?id=11121fe4

=== End SRU Template ===

apt-daily is now a systemd service rather than being invoked by cron.daily. If one builds a custom AMI it is possible that the apt-daily.timer will fire during boot. This can fire at the same time cloud-init is running and if cloud-init loses the race the invocation of apt (e.g. use of "packages:" in the config) will fail.

There is a lot of discussion online about this change to apt-daily (e.g. unattended upgrades happening during business hours, delaying boot, etc.) and discussion of potential systemd changes regarding timers firing during boot (c.f. https://github.com/systemd/systemd/issues/5659).

While it would be better to solve this in apt itself, I suggest that cloud-init be defensive when calling apt and implement some retry mechanism.

Various instances of people running into this issue:

https://github.com/chef/bento/issues/609 https://clusterhq.atlassian.net/browse/FLOC-4486 https://github.com/boxcutter/ubuntu/issues/73 https://unix.stackexchange.com/questions/315502/how-to-disable-apt-daily-service-on-ubuntu-cloud-vm-image

ubuntu-server-builder commented 1 year ago

Launchpad user Steve Langasek(vorlon) wrote on 2017-05-24T21:30:51+00:00

On Wed, May 24, 2017 at 09:10:37PM -0000, Jim Browne wrote:

While it would be better to solve this in apt itself, I suggest that cloud-init be defensive when calling apt and implement some retry mechanism.

I would suggest instead that cloud-init should declare itself Before=apt-daily.service / apt-daily.timer, so that cloud-init takes precedence over apt-daily on first boot.

ubuntu-server-builder commented 1 year ago

Launchpad user Jim Browne(jbrowne) wrote on 2017-05-24T21:42:51.947736+00:00

My concern is another apt dependent task being added somewhere else in systemd that winds up triggering during boot. IMO it's better to be generically defensive about the use of apt, but others certainly have more context and information than I do.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-05-25T18:38:43.913791+00:00

I suspect that Steve's suggestion should fix this mostly for cloud-init. Apt does of course have a general locking problem that really does need addressing.

We've all seen workarounds/retries at all sorts of levels to address the problems that a.) you basically have to run 'apt-get update' before you run 'apt-get install' (bug 1429285), which results in the over-usage of that fairly heavy resource.

b.) if another process is running 'apt-get install' or 'apt-get remove' when you attempt, you will fail with the lock contention.

These things should be solved in apt, not worked around in yet another process that uses it.

ubuntu-server-builder commented 1 year ago

Launchpad user Chris White(cwprogram) wrote on 2017-06-01T22:17:19.159801+00:00

Some research on this indicates:

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Andres Klode(juliank) wrote on 2017-06-13T00:20:52.840976+00:00

We eventually want wait locking in apt, but I don't think it really solves all issues, especially in scripts with multiple apt invocations. Which is why apt-daily got an additional flock lock for the upcoming SRUs. (see artful).

Feel free.to wait on the same.lock and probably add some ordering against apt-daily and apt-daily-upgrade services.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-06-27T21:56:55.030674+00:00

This bug was fixed in the package cloud-init - 0.7.9-197-gebc9ecbc-0ubuntu1


cloud-init (0.7.9-197-gebc9ecbc-0ubuntu1) artful; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user Steve Langasek(vorlon) wrote on 2017-06-29T04:45:32.490088+00:00

Hello Jim, or anyone else affected,

Accepted cloud-init into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~17.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty.If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

ubuntu-server-builder commented 1 year ago

Launchpad user Steve Langasek(vorlon) wrote on 2017-06-29T04:53:17.173086+00:00

Hello Jim, or anyone else affected,

Accepted cloud-init into yakkety-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~16.10.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-yakkety to verification-done-yakkety.If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-yakkety. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

ubuntu-server-builder commented 1 year ago

Launchpad user Steve Langasek(vorlon) wrote on 2017-06-29T04:55:56.401797+00:00

Hello Jim, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.9-153-g16a7302f-0ubuntu1~16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial.If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-06-29T19:22:00.003970+00:00

$ for rel in xenial yakkety zesty; do lxc-proposed-snapshot --proposed $rel proposed-$rel --publish || break; done

$ for rel in xenial yakkety zesty; do lxc launch proposed-$rel "--config=user.user-data=$(cat config.yaml)" test-$rel || break; done

$ sleep 2m

$ for rel in xenial yakkety zesty; do mkdir $rel && ( cd $rel && lxc exec test-$rel -- journalctl -o short-precise > journal.log && lxc exec test-$rel -- dpkg-query --show cloud-init > cloud-init-dpkg.txt && lxc file pull test-$rel/var/log/cloud-init.log cloud-init.log && lxc file pull test-$rel/var/log/cloud-init-output.log cloud-init-output.log ) || break; done

$ for rel in xenial yakkety zesty; do tar -czf /tmp/1693361-$rel.tar.gz $rel; done

Launchpad attachments: xenial results

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-06-29T19:22:19.634597+00:00

Launchpad attachments: yakkety results

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-06-29T19:22:45.775471+00:00

Launchpad attachments: zesty results

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-06-29T21:48:59.868249+00:00

This bug was fixed in the package cloud-init - 0.7.9-153-g16a7302f-0ubuntu1~16.04.2


cloud-init (0.7.9-153-g16a7302f-0ubuntu1~16.04.2) xenial-proposed; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user Steve Langasek(vorlon) wrote on 2017-06-29T21:49:11.119691+00:00

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2017-07-26T18:27:42.552325+00:00

This bug was fixed in the package cloud-init - 0.7.9-153-g16a7302f-0ubuntu1~17.04.2


cloud-init (0.7.9-153-g16a7302f-0ubuntu1~17.04.2) zesty-proposed; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2017-09-23T02:33:29.402324+00:00

This bug is believed to be fixed in cloud-init in 17.1. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Andres Klode(juliank) wrote on 2018-08-22T10:54:05.607823+00:00

Nothing actionable here for apt, so I'll close this. We should consider making frontend locking more flexible for scripts using apt, though, so scripts can hold the lock all the time and drive apt.

ubuntu-server-builder commented 1 year ago

Launchpad user David Reis(dryd) wrote on 2021-11-12T12:29:16.114889+00:00

This is not fixed, it just affected me on Ubuntu 20.04.3 LTS, resulting in the the subsequent server configuration failing completely because awscli and jq were missing.

Output:

Cloud-init v. 21.3-1-g6803368d-0ubuntu1~20.04.4 running 'modules:config' at Fri, 12 Nov 2021 11:05:29 +0000. Up 18.13 seconds. Get:1 http://eu-central-1.ec2.archive.ubuntu.com/ubuntu focal InRelease [265 kB] [... more Gets] E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2764 (unattended-upgr) E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it? Cloud-init v. 21.3-1-g6803368d-0ubuntu1~20.04.4 running 'modules:final' at Fri, 12 Nov 2021 11:05:30 +0000. Up 19.15 seconds. 2021-11-12 11:05:38,955 - util.py[WARNING]: Package upgrade failed E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2764 (unattended-upgr) E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it? 2021-11-12 11:05:38,999 - util.py[WARNING]: Failed to install packages: ['awscli', 'nmap', 'tcpdump', 'bind9utils', 'curl', 'wget', 'vim', 'jq', 'htop', 'tmux', 'git', 'iotop', 'iftop', 'fail2ban'] 2021-11-12 11:05:38,999 - cc_package_update_upgrade_install.py[WARNING]: 2 failed with exceptions, re-raising the last one 2021-11-12 11:05:39,000 - util.py[WARNING]: Running module package-update-upgrade-install (<module 'cloudinit.config.cc_package_update_upgrade_install' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_package_update_upgrade_install.py'>) failed

Note: Before=apt-daily.service is only set on cloud-final.service.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Andres Klode(juliank) wrote on 2021-11-12T12:49:23.152055+00:00

Arguably it should run before apt-daily-upgrade too. apt-daily-upgrade is the one locking dpkg; apt-daily locks apt lists (and cache) directory.

ubuntu-server-builder commented 1 year ago

Launchpad user David Reis(dryd) wrote on 2021-11-12T13:06:21.416123+00:00

Ah, thanks, I wasn't aware they're distinct. So would simply adding apt-daily-upgrade.service to the Before via cloud-init's bootcmd and then issuing a daemon-reload be a suitable workaround? There's a 30s window until the upgrade process starts if apt's history.log is to be trusted. That is probably enough to be somewhat reliable.

ubuntu-server-builder commented 1 year ago

Launchpad user Koen Serneels(eskubu) wrote on 2021-12-06T13:49:34.819971+00:00

From cloud-init point of view the solution now implemented make sense: to run it before the apt-daily-upgrade. However, I wanted to add that there are other use cases as well such as SSM documents being executed on instances. These can be executed in batch at any time and may also require installation of packages and thus interfere with these unattended upgrades.

The execution of documents is not linked directly to cloud-init and may be ran after the instances has been booted, so this falls in the other category of having some kind queuing system or at least a centralized way to obtain a lock to be able to use apt. At the moment there are dozens of different possibilities how to get a mutex to be able to execute apt, but somehow we couldn't find a bullet proof way that works every time.

So maybe this does not really fit into this ticket, but to address that this is only a partial fix to a bigger problem.

ubuntu-server-builder commented 1 year ago

Launchpad user James Falcon(falcojr) wrote on 2021-12-06T15:59:22.390185+00:00

Not sure if this helps, but we recently added behavior to wait for an apt lock when doing apt commands. This will be included in our next release: https://github.com/canonical/cloud-init/pull/1034

If there are still remaining issues, please open a new bug rather than commenting here. This bug won't be re-opened.

ubuntu-server-builder commented 1 year ago

Launchpad user Julian Andres Klode(juliank) wrote on 2021-12-06T16:56:33.360031+00:00

Since 20.04, apt can wait for a lock.

The apt(8) command automatically waits for a lock for 120 seconds (non-interactive) or infinitely.

The apt-get(8) command can be configured to wait as well by passing the -o DPkg::Lock::Timeout=, where may also be -1 for infinite.

This avoids any races you'd get by doing the lock yourself and then invoking apt.

ubuntu-server-builder commented 1 year ago

Launchpad user Jesús Gómez(jgomo3) wrote on 2022-04-13T17:38:06.501520+00:00

2022 still happens on AWS Ubuntu 20.04.

But in my case, is 100% of the time, not sometimes.

This user-data:

#cloud-config

package_update: true
package_upgrade: true

packages:
  - awscli
  - jq
sudo cloud-init status
status: error

Logs collected and attached.

Launchpad attachments: cloud-init.tar.gz