canonical / cloud-init

Official upstream for the cloud-init: cloud instance initialization
https://cloud-init.io/
Other
2.88k stars 857 forks source link

[SRU] CloudSigma DS for causes hangs when serial console present #2451

Closed ubuntu-server-builder closed 1 year ago

ubuntu-server-builder commented 1 year ago

This bug was originally filed in Launchpad as LP: #1316475

Launchpad details
affected_projects = ['diskimage-builder', 'tripleo', 'cloud-init (Ubuntu)', 'cloud-init (Ubuntu Trusty)']
assignee = None
assignee_name = None
date_closed = 2014-10-10T15:35:57.814124+00:00
date_created = 2014-05-06T08:47:08.698235+00:00
date_fix_committed = 2014-06-03T20:46:58.355740+00:00
date_fix_released = 2014-10-10T15:35:57.814124+00:00
id = 1316475
importance = high
is_complete = True
lp_url = https://bugs.launchpad.net/cloud-init/+bug/1316475
milestone = None
owner = gandelman-a
owner_name = Adam Gandelman
private = False
status = fix_released
submitter = lifeless
submitter_name = Robert Collins
tags = ['patch', 'verification-done']
duplicates = [1322444]

Launchpad user Robert Collins(lifeless) wrote on 2014-05-06T08:47:08.698235+00:00

SRU Justification

Impact: The Cloud Sigma Datasource read and writes to /dev/ttyS1 if present; the Datasource does not have a time out. On non-CloudSigma Clouds or systems w/ /dev/ttyS1, Cloud-init will block pending a response, which may never come. Further, it is dangerous for a default datasource to write blindly on a serial console as other control plane software and Clouds use /dev/ttyS1 for communication.

Fix: The patch queries the BIOS to see if the instance is running on CloudSigma before querying /dev/ttys1.

Verification: On both a CloudSigma instance and non-CloudSigma instance with /dev/ttys1:

  1. Install new cloud-init
  2. Purge existing cloud-init data (rm -rf /var/lib/cloud)
  3. Run "cloud-init --debug init"
  4. Confirm that CloudSigma provisioned while CloudSigma datasource skipped non-CloudSigma instance

Regression: The risk is low, as this change further restrict where the CloudSigma Datasource can run.

[Original Report] DHCPDISCOVER on eth2 to 255.255.255.255 port 67 interval 3 (xid=0x7e777c23) DHCPREQUEST of 10.22.157.186 on eth2 to 255.255.255.255 port 67 (xid=0x7e777c23) DHCPOFFER of 10.22.157.186 from 10.22.157.149 DHCPACK of 10.22.157.186 from 10.22.157.149 bound to 10.22.157.186 -- renewal in 39589 seconds.   Starting Mount network filesystems [ OK ]   Starting configure network device [ OK ]   Stopping Mount network filesystems [ OK ]   Stopping DHCP any connected, but unconfigured network interfaces [ OK ]   Starting configure network device [ OK ]   Stopping DHCP any connected, but unconfigured network interfaces [ OK ]  * Starting configure network device [ OK ]

And it stops there.

I see this on about 10% of deploys.

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Collins(lifeless) wrote on 2014-05-14T19:32:30.829703+00:00

Affects https://etherpad.openstack.org/p/tripleo-end-to-end-automatic-bm

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Collins(lifeless) wrote on 2014-05-19T03:13:12.916581+00:00

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Collins(lifeless) wrote on 2014-05-19T03:15:09.799928+00:00

the job things hang on: description "configure network device"

emits net-device-up emits net-device-down emits static-network-up

start on net-device-added stop on net-device-removed INTERFACE=$INTERFACE

instance $INTERFACE export INTERFACE

pre-start script if [ "$INTERFACE" = lo ]; then

bring this up even if /etc/network/interfaces is broken

    ifconfig lo 127.0.0.1 up || true
    initctl emit -n net-device-up \
        IFACE=lo LOGICAL=lo ADDRFAM=inet METHOD=loopback || true
fi
mkdir -p /run/network
exec ifup --allow auto $INTERFACE

end script

post-stop exec ifdown --force --allow auto $INTERFACE

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Gandelman(gandelman-a) wrote on 2014-05-23T00:45:02.729966+00:00

FWIW, getting a shell on an instance stuck in this state shows the cloud-init is still running:

root 894 0.0 0.0 15260 636 ? S May22 0:00 upstart-socket-bridge --daemon root 1051 0.0 0.0 86100 20936 ? Ss May22 0:00 /usr/bin/python /usr/bin/cloud-init init root 1060 0.0 0.0 4444 648 ? S May22 0:00 /bin/sh -c tee -a /var/log/cloud-init-output.log root 1061 0.0 0.0 4348 584 ? S May22 0:00 tee -a /var/log/cloud-init-output.log root 1263 0.0 0.0 10224 2408 ? Ss May22 0:00 dhclient -1 -v -pf /run/dhclient.eth2.pid -lf /var/lib/dhcp/dhclient.eth2.leases eth2 ntp 1395 0.0 0.0 31444 2012 ? Ss May22 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 107:112 root 1417 0.0 0.0 25108 1056 ? S May22 0:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 107:112

At this point /etc/network/interfaces has the correct entry (eth2 in this case) and it has dhcp'd its address.

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Gandelman(gandelman-a) wrote on 2014-05-23T02:15:17.446434+00:00

gdb'ing the stuck cloud-init process shows it suck in a select() caused by http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/view/head:/cloudinit/cs_utils.py#L81. It looks like a new datasource (DataSourceCloudSigma) was added to cloud-init since saucy. It attempts to read/write from /dev/ttyS0, hangs and blocks boot. Killing the process gets boot going (albeit incomplete WRT cloud-init). As a workaround, updating the image and simply deleting usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceCloudSigma.py fixes the issue.

ubuntu-server-builder commented 1 year ago

Launchpad user Gregory Haynes(greghaynes) wrote on 2014-05-23T06:10:45.918318+00:00

I was able to reproduce this in a VM reliably by simply adding a second serial device and booting a cloud image with no cloud-init datasources.

Epic find Adam!

ubuntu-server-builder commented 1 year ago

Launchpad user Ben Howard(darkmuggle-deactivatedaccount) wrote on 2014-05-26T20:03:43.633789+00:00

The culprit here is that there is no timeout on the serial console read/write.

From cloudinit/cs_utils.py 73 def init(self, request): 74 self.request = request 75 self.raw_result = self._execute() 76 self.result = self._marshal(self.raw_result) 77 78 def _execute(self): 79 connection = serial.Serial(SERIAL_PORT) 80 connection.write(self.request) 81 return connection.readline().strip('\x04\n')

Further, since we are blocking on the serial port, I have to question whether or not this should be a default enabled source. The other serial terminal DS is SmartOS, which is disabled by default. There are a lot of good reasons why people attach serial consoles, but assuming that it safe for cloud-init to read/write to a serial console seems like a great way to break infrastructure or control planes.

IMHO, I think that the fix should be twofold 1) disable this ds by default; 2) enforce a reasonable time out. I've attached a rough patch of what I am thinking here.

We should get CloudSigma to clarify what the timeout should be before we enforce the timeout.

That said, I think that an SRU that disables the DS is warranted.
Launchpad attachments: Make CloudSigma datasource timeout on serial console

ubuntu-server-builder commented 1 year ago

Launchpad user Ben Howard(darkmuggle-deactivatedaccount) wrote on 2014-05-26T20:10:50.090274+00:00

ubuntu-server-builder commented 1 year ago

Launchpad user Robert Collins(lifeless) wrote on 2014-05-27T00:15:34.029607+00:00

+1 on disabling cloudsigma by default.

ubuntu-server-builder commented 1 year ago

Launchpad user Ben Howard(darkmuggle-deactivatedaccount) wrote on 2014-05-27T16:43:38.243739+00:00

Launchpad attachments: Debdiff for 14.04 LTS patch

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-05-27T19:44:54.289900+00:00

We're looking at this. The general rule in cloud-init should be "enabled by default if and only if there is no negative side effects". The one exception is the EC2 metadata service (it polls and has very annoying timeouts). However, its generally configured to be last, so all others have failed at that point.

We'll see if there is some way we can determine that we're running on CloudSigma and if so, then block on ttyS1. If not, go on quickly.

ubuntu-server-builder commented 1 year ago

Launchpad user Viktor Petersson(vpetersson) wrote on 2014-05-30T06:07:35.619658+00:00

@scott We're looking at this internally now and hope to have a fix that addresses that adds some unique variables shortly as suggested.

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Gandelman(gandelman-a) wrote on 2014-06-03T19:09:36.640155+00:00

Proposed DIB fix here: https://review.openstack.org/95598

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2014-06-03T21:05:50.794075+00:00

This bug was fixed in the package cloud-init - 0.7.6~bzr976-0ubuntu1


cloud-init (0.7.6~bzr976-0ubuntu1) utopic; urgency=medium

ubuntu-server-builder commented 1 year ago

Launchpad user OpenStack Infra(hudson-openstack) wrote on 2014-06-03T21:56:42.461317+00:00

Fix proposed to branch: master Review: https://review.openstack.org/97634

ubuntu-server-builder commented 1 year ago

Launchpad user OpenStack Infra(hudson-openstack) wrote on 2014-06-10T06:43:52.335905+00:00

Reviewed: https://review.openstack.org/95598 Committed: https://git.openstack.org/cgit/openstack/diskimage-builder/commit/?id=f645287ec45ef49eaee9a04f5d18e2a9c7d928db Submitter: Jenkins Branch: master

commit f645287ec45ef49eaee9a04f5d18e2a9c7d928db Author: Adam Gandelman adamg@ubuntu.com Date: Mon May 26 14:35:57 2014 -0700

Add new cloud-init-datasources element

This moves cloud-init data source configuration to a general purpose
cloud-init-datasources element that can be used to explicitly configure
the list of cloud-init sources that will be queried on first boot.

cloud-init-nocloud now depends on this new element to configure the
datasource_list while continuing to prep the image for a nocloud first boot.

Change-Id: Ibcc3b86d6ca567a23f89b7a1a36bc713e444ef68
Closes-bug: #1316475
ubuntu-server-builder commented 1 year ago

Launchpad user OpenStack Infra(hudson-openstack) wrote on 2014-06-10T23:06:29.936834+00:00

Reviewed: https://review.openstack.org/97634 Committed: https://git.openstack.org/cgit/openstack/diskimage-builder/commit/?id=f61c1acf81dc73aaa3ed80ff734dbe0a6817b284 Submitter: Jenkins Branch: master

commit f61c1acf81dc73aaa3ed80ff734dbe0a6817b284 Author: Adam Gandelman adamg@ubuntu.com Date: Tue Jun 3 14:54:22 2014 -0700

Only use Ec2 cloud-init data source for Ubuntu

Default to only having cloud-init query Ec2 on first boot for Ubuntu,
until cloud-init has been SRU'd to fix the CloudSigma data source issue
that causes Trusty boots to hang.

Change-Id: Icb3734d5ae78f4a0a6c0fae1af4a2ce3c809308c
Partial-bug: #1316475
ubuntu-server-builder commented 1 year ago

Launchpad user Ben Howard(darkmuggle-deactivatedaccount) wrote on 2014-06-18T22:01:45.915310+00:00

Proposing backported CloudSigma DS from 14.10 as fixing this issue for SRU. Launchpad attachments: Debdiff of backported 14.10 DS

ubuntu-server-builder commented 1 year ago

Launchpad user Chris J Arges(arges) wrote on 2014-06-24T15:27:29.966641+00:00

Hello Robert, or anyone else affected,

Accepted cloud-init into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.7.5-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Gandelman(gandelman-a) wrote on 2014-06-28T19:58:15.609878+00:00

Was able to test the proposed package and verify the issue is resolved. Test:

1)

2)

Thanks for the fix.

ubuntu-server-builder commented 1 year ago

Launchpad user Launchpad Janitor(janitor) wrote on 2014-07-22T15:10:50.337152+00:00

This bug was fixed in the package cloud-init - 0.7.5-0ubuntu1.1


cloud-init (0.7.5-0ubuntu1.1) trusty-proposed; urgency=medium

[ Ben Howard ]

ubuntu-server-builder commented 1 year ago

Launchpad user Adam Conrad(adconrad) wrote on 2014-07-22T15:11:10.735053+00:00

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

ubuntu-server-builder commented 1 year ago

Launchpad user Scott Moser(smoser) wrote on 2014-10-10T15:35:55.430041+00:00

fixed in 0.7.6