cloudfoundry / bosh-linux-stemcell-builder

BOSH Ubuntu Linux stemcells
Apache License 2.0
41 stars 103 forks source link

Jammy Ephemeral disk error for bosh vms #261

Closed adam-jian-zhang closed 1 year ago

adam-jian-zhang commented 1 year ago

We updated to jammy recently, we hit the bosh compliation VM timeout issue quite often since, error out like this:

Task 34 | 07:02:26 | Compiling packages: database-backup-restorer-postgres-13/1d320faa1b7bd9c5bfb57ac2bec6502fe2ffa5d03fdfea5b4896f9000c30b19f (00:06:27) Task 34 | 07:02:26 | Compiling packages: database-backup-restorer-postgres-9.4/70d321821ff300fbaef47d64fb7f7b5d33ede23c2349cbf1950886c40f25c2e8 Task 34 | 07:03:56 | Compiling packages: database-backup-restorer-mysql-8.0/0c4daf59c89614acd0ea73295daf423b0df78ed5abfe289ea42e12b927ca6f65 (00:11:06) L Error: Timed out pinging VM 'vm-a7d54522-a83e-4801-933f-1870a1b38559' with agent 'c06df9cb-2cdc-47fa-905e-3ef3e5a5bf27' after 600 seconds Task 34 | 07:05:19 | Compiling packages: database-backup-restorer-postgres-10/86674dfed7d233cbf0d260280bfb840bc9ad16a38c938099373389a7559c3740 (00:05:50) Task 34 | 07:07:33 | Compiling packages: database-backup-restorer-postgres-9.4/70d321821ff300fbaef47d64fb7f7b5d33ede23c2349cbf1950886c40f25c2e8 (00:05:07) Task 34 | 07:07:46 | Compiling packages: database-backup-restorer-postgres-9.6/586f5a12f2215763da655d68ae7a07ad65d95062dfb71bc195f8e48e317343b7 (00:05:43) Task 34 | 07:08:11 | Error: Timed out pinging VM 'vm-a7d54522-a83e-4801-933f-1870a1b38559' with agent 'c06df9cb-2cdc-47fa-905e-3ef3e5a5bf27' after 600 seconds

We later attached the failed compliation VM to another VM, found out the 2nd ephemral disk is not formated, logs from first os disk:

-- 2022-11-16_06:55:17.51383 [linuxPlatform] 2022/11/16 06:55:17 INFO - Setting up ephemeral disk... 2022-11-16_06:55:17.51384 [File System] 2022/11/16 06:55:17 DEBUG - Glob '/var/vcap/data/*' 2022-11-16_06:55:17.51405 [File System] 2022/11/16 06:55:17 DEBUG - Making dir /var/vcap/data with perm 0750 2022-11-16_06:55:17.51406 [main] 2022/11/16 06:55:17 ERROR - App setup Running bootstrap: Setting up ephemeral disk: No ephemeral disk found, cannot use root partition as ephemeral disk 2022-11-16_06:55:17.51418 [main] 2022/11/16 06:55:17 ERROR - Agent exited with error: Running bootstrap: Setting up ephemeral disk: No ephemeral disk found, cannot use root partition as ephemeral disk 2022-11-16_06:55:17.55575 [main] 2022/11/16 06:55:17 DEBUG - Starting agent 2022-11-16_06:55:17.55579 [File System] 2022/11/16 06:55:17 DEBUG - Reading file /var/vcap/bosh/agent.json 2022-11-16_06:55:17.55580 [File System] 2022/11/16 06:55:17 DEBUG - Read content

2022-11-16_06:55:19.92779 [ConcreteUdevDevice] 2022/11/16 06:55:19 DEBUG - Kicking device, attempt 4 of 5 2022-11-16_06:55:19.92787 [ConcreteUdevDevice] 2022/11/16 06:55:19 DEBUG - readBytes from file: /dev/sr0 2022-11-16_06:55:20.51982 [ConcreteUdevDevice] 2022/11/16 06:55:20 DEBUG - readBytes from file: /dev/sr0 2022-11-16_06:55:20.61087 [ConcreteUdevDevice] 2022/11/16 06:55:20 ERROR - Failed to read byte from device: open /dev/sr0: no medium found 2022-11-16_06:55:20.61092 [ConcreteUdevDevice] 2022/11/16 06:55:20 DEBUG - Settling UdevDevice 2022-11-16_06:55:20.61094 [Cmd Runner] 2022/11/16 06:55:20 DEBUG - Running command 'udevadm settle' 2022-11-16_06:55:20.63129 [Cmd Runner] 2022/11/16 06:55:20 DEBUG - Stdout:

2022-11-16_06:55:22.99528 [ConcreteUdevDevice] 2022/11/16 06:55:22 DEBUG - readBytes from file: /dev/sr0 2022-11-16_06:55:23.07899 [ConcreteUdevDevice] 2022/11/16 06:55:23 DEBUG - Ignorable error from readByte: open /dev/sr0: no medium found 2022-11-16_06:55:23.58041 [ConcreteUdevDevice] 2022/11/16 06:55:23 DEBUG - readBytes from file: /dev/sr0 2022-11-16_06:55:23.66682 [settingsService] 2022/11/16 06:55:23 ERROR - Failed loading settings via fetcher: Getting settings from all sources: Reading files from CDROM: Waiting for CDROM to be ready: Reading udev device: open /dev/sr0: no medium found 2022-11-16_06:55:23.66694 [settingsService] 2022/11/16 06:55:23 DEBUG - Successfully read settings from file 2022-11-16_06:55:23.66746 [File System] 2022/11/16 06:55:23 DEBUG - Checking if file exists /var/vcap/bosh/update_settings.json 2022-11-16_06:55:23.66747 [File System] 2022/11/16 06:55:23 DEBUG - Stat '/var/vcap/bosh/update_settings.json'

It happens on vsphere platform quite often, but we do not have the problem with xenial stemcell on the same platform and some time also on aws environment, so we suspect this is a stemcell issue.

adam-jian-zhang commented 1 year ago

Reproduced the problem, found the disk order is wrong: The 5G disk is bootdisk, and 16G disk is ephermeral disk, in this vm, the bootdisk is /dev/sdb, while in normal setup it should be /dev/sda.

/:/var/log# fdisk -l

Disk /dev/sdb: 5 GiB, 5368709120 bytes, 10485760 sectors
Disk model: Virtual disk
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x47d4d47b

Device Boot Start End Sectors Size Id Type
/dev/sdb1 63 9998046 9997984 4.8G 83 Linux

Disk /dev/sda: 16 GiB, 17179869184 bytes, 33554432 sectors
Disk model: Virtual disk
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
ramonskie commented 1 year ago

a fix has been proposed with https://github.com/cloudfoundry/bosh-agent/pull/298

rkoster commented 1 year ago

The above PR has been merged