balena-os / takeover

Migrate arbitrary devices to balena
12 stars 8 forks source link

Migration fails with `kill_procs failed` in stage2 #62

Closed lmbarros closed 5 months ago

lmbarros commented 7 months ago

Stage2 logs end like this:

2024-02-28 13:54:58 ERROR [takeover::stage2] kill_procs failed, error An error occurred in an upstream function, context: Failed to read link '/proc/472/fd/0'
  caused by: No such file or directory (os error 2)

And a little earlier we can see PID 472 refers to polkitd:

2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    472 name: polkitd   executable: /mnt/old_root/usr/lib/polkit-1/polkitd  ppid: 1
lmbarros commented 7 months ago

Details about this instance of the error:

Complete logs follow.

stage1.log:

2024-02-28 13:54:29 INFO  Detected OS Architecture is ARM64
2024-02-28 13:54:29 INFO  Disabling all device type-related checks due to no-dt-check option
2024-02-28 13:54:29 INFO  Detected device type: Dummy,
2024-02-28 13:54:29 INFO  Configured for application id: 2034527
2024-02-28 13:54:29 INFO  connection to api: api.balena-cloud.com:80 is ok
2024-02-28 13:54:29 INFO  connection to vpn: cloudlink.balena-cloud.com:443 is ok
2024-02-28 13:54:29 INFO  config.json is for device type raspberrypi3-64
2024-02-28 13:54:29 WARN  [takeover::stage1::image_retrieval] OS download is not supported for device type 'raspberrypi3-64', proceeding due to no-dt-check option
2024-02-28 13:54:29 DEBUG [takeover::stage1::api_calls] get_os_versions: request_url: 'https://api.balena-cloud.com/v6/release?$select=raw_version&$filter=(is_final%20eq%20true)%20and%20(is_passing_tests%20eq%20true)%20and%20(is_invalidated%20eq%20false)%20and%20(status%20eq%20%27success%27)%20and%20(belongs_to__application/any(bta:((bta/is_host%20eq%20true)%20and%20(bta/is_for__device_type/any(iodt:iodt/slug%20eq%20%27raspberrypi3-64%27)))%20and%20(not(bta/application_tag/any(at:at/tag_key%20eq%20%27release-policy%27))%20or%20(bta/application_tag/any(at:(at/tag_key%20eq%20%27release-policy%27)%20and%20(at/value%20eq%20%27default%27))))))&$orderby=created_at%20desc'
2024-02-28 13:54:30 DEBUG [reqwest::connect] starting new connection: https://api.balena-cloud.com/
2024-02-28 13:54:30 DEBUG [takeover::stage1::api_calls] Result = Response { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("api.balena-cloud.com")), port: None, path: "/v6/release", query: Some("$select=raw_version&$filter=(is_final%20eq%20true)%20and%20(is_passing_tests%20eq%20true)%20and%20(is_invalidated%20eq%20false)%20and%20(status%20eq%20%27success%27)%20and%20(belongs_to__application/any(bta:((bta/is_host%20eq%20true)%20and%20(bta/is_for__device_type/any(iodt:iodt/slug%20eq%20%27raspberrypi3-64%27)))%20and%20(not(bta/application_tag/any(at:at/tag_key%20eq%20%27release-policy%27))%20or%20(bta/application_tag/any(at:(at/tag_key%20eq%20%27release-policy%27)%20and%20(at/value%20eq%20%27default%27))))))&$orderby=created_at%20desc"), fragment: None }, status: 200, headers: {"date": "Wed, 28 Feb 2024 13:54:30 GMT", "content-type": "application/json; charset=utf-8", "transfer-encoding": "chunked", "connection": "keep-alive", "cache-control": "no-cache", "etag": "W/\"5b7-3SqWAv4pwRF8/cK1Z98jFzBk1JA\"", "vary": "Accept-Encoding", "cf-cache-status": "DYNAMIC", "strict-transport-security": "max-age=15552000", "server": "cloudflare", "cf-ray": "85c928eabb70bcc3-POA", "alt-svc": "h3=\":443\"; ma=86400"} }
2024-02-28 13:54:30 DEBUG [takeover::stage1::image_retrieval] Looking at version 5.0.8
2024-02-28 13:54:30 INFO  Selected default version (5.0.8) for download
2024-02-28 13:54:30 INFO  Downloading Balena OS image, selected version is: '5.0.8'
2024-02-28 13:54:30 DEBUG [takeover::stage1::api_calls] get_os_image: request_url: 'https://api.balena-cloud.com/download'
2024-02-28 13:54:30 DEBUG [takeover::stage1::api_calls] get_os_image: data: 'ImageRequestData { device_type: "raspberrypi3-64", version: "5.0.8", file_type: ".gz" }'
2024-02-28 13:54:30 DEBUG [reqwest::connect] starting new connection: https://api.balena-cloud.com/
2024-02-28 13:54:32 DEBUG [takeover::stage1::api_calls] Result = Response { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("api.balena-cloud.com")), port: None, path: "/download", query: None, fragment: None }, status: 200, headers: {"date": "Wed, 28 Feb 2024 13:54:32 GMT", "content-type": "application/octet-stream", "content-length": "180411412", "connection": "keep-alive", "vary": "X-HTTP-Method-Override", "access-control-allow-origin": "*", "access-control-allow-methods": "GET, PUT, POST, PATCH, DELETE, OPTIONS, HEAD", "access-control-allow-headers": "Content-Type, Authorization, Application-Record-Count, MaxDataServiceVersion, X-Requested-With", "access-control-allow-credentials": "true", "access-control-max-age": "86400", "content-disposition": "attachment; filename=\"raspberrypi3-64-5.0.8-v15.0.4.img.gz\"", "x-transfer-length": "180411412", "cache-control": "no-store, no-transform", "cf-cache-status": "DYNAMIC", "strict-transport-security": "max-age=15552000", "server": "cloudflare", "cf-ray": "85c928ef2d05bccf-POA", "alt-svc": "h3=\":443\"; ma=86400"} }
2024-02-28 13:54:32 DEBUG [takeover::stage1::image_retrieval] Downloading file '/home/lmb/balena-cloud-raspberrypi3-64-5.0.8.img.gz'
2024-02-28 13:54:44 INFO  132 MiB read in 11 seconds @12 MiB/sec 
2024-02-28 13:54:47 INFO  172 MiB read in 14 seconds @12 MiB/sec 
2024-02-28 13:54:47 INFO  The balena OS image was successfully written to '/home/lmb/balena-cloud-raspberrypi3-64-5.0.8.img.gz'
2024-02-28 13:54:47 DEBUG [takeover::stage1::migrate_info] image path: '/home/lmb/balena-cloud-raspberrypi3-64-5.0.8.img.gz'
2024-02-28 13:54:47 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "537\n", stderr: "" }
2024-02-28 13:54:47 WARN  [takeover::stage1::migrate_info] No Network manager files were found, the device might not be able to come online
2024-02-28 13:54:47 INFO  Writing hostname to config.json: 'raspberrypi'
2024-02-28 13:54:47 DEBUG [takeover::common] parse_file: line: 'PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"'
2024-02-28 13:54:52 INFO  Preparing for takeover..
2024-02-28 13:54:52 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "", stderr: "" }
2024-02-28 13:54:52 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "\tlinux-vdso.so.1 (0x0000007fba1f6000)\n\tlibc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007fb9fd0000)\n\t/lib/ld-linux-aarch64.so.1 (0x0000007fba1b9000)\n", stderr: "" }
2024-02-28 13:54:52 DEBUG [takeover::stage1::exe_copy] get_libs: no path for linux-vdso.so.1
2024-02-28 13:54:52 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "\tnot a dynamic executable\n" }
2024-02-28 13:54:52 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "\tstatically linked\n", stderr: "" }
2024-02-28 13:54:52 DEBUG [takeover::stage1::exe_copy] get_libs: lib is statically linked: '/lib/ld-linux-aarch64.so.1'
2024-02-28 13:54:52 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "\t/lib/ld-linux-aarch64.so.1 (0x0000007fae965000)\n\tlinux-vdso.so.1 (0x0000007fae9a2000)\n", stderr: "" }
2024-02-28 13:54:52 DEBUG [takeover::stage1::exe_copy] get_libs: no path for linux-vdso.so.1
2024-02-28 13:54:52 DEBUG [takeover::stage1] Space required for commands: 11 MiB
2024-02-28 13:54:52 INFO  Found 908 MiB total, 232 MiB free memory
2024-02-28 13:54:52 DEBUG [takeover::common::system] mkdir: '/balena-takeover'
2024-02-28 13:54:52 INFO  Using '/balena-takeover' as takeover directory
2024-02-28 13:54:52 INFO  Mounted tmpfs file system on '/balena-takeover'
2024-02-28 13:54:52 INFO  Created mtab in  '/balena-takeover/etc/mtab'
2024-02-28 13:54:52 INFO  Mounted proc file system on '/balena-takeover/proc'
2024-02-28 13:54:52 INFO  Mounted tmpfs file system on '/balena-takeover/tmp'
2024-02-28 13:54:52 INFO  Mounted sys file system on '/balena-takeover/sys'
2024-02-28 13:54:52 INFO  Mounted dev file system on '/balena-takeover/dev'
2024-02-28 13:54:52 DEBUG [takeover::common::debug] After dev mount
2024-02-28 13:54:52 DEBUG [takeover::common::debug] check if /dev/loop-control exists in new /dev: true
2024-02-28 13:54:52 DEBUG [takeover::common::debug] open /dev/loop-control succeeded
2024-02-28 13:54:52 INFO  Mounted devpts file system on '/balena-takeover/dev/pts'
2024-02-28 13:54:52 INFO  Created directory '/balena-takeover/mnt/old_root'
2024-02-28 13:54:52 DEBUG [takeover::stage1::exe_copy] copy_file: copying '/lib/ld-linux-aarch64.so.1' to '/balena-takeover/lib/ld-linux-aarch64.so.1'
2024-02-28 13:54:52 DEBUG [takeover::stage1::exe_copy] copy_file: copying '/lib/aarch64-linux-gnu/libc.so.6' to '/balena-takeover/lib/aarch64-linux-gnu/libc.so.6'
2024-02-28 13:54:52 INFO  Copied '/bin/dd' to '/balena-takeover/bin/dd'
2024-02-28 13:54:53 INFO  Copied '/home/lmb/takeover' to '/balena-takeover/bin/takeover'
2024-02-28 13:54:53 INFO  Copied config.json to '/home/lmb/config.Kpj6ZR7Hdf.json'
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::mount] from_mtab: processing mount Mount { mountpoint: "/", fs_type: "ext4" }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::mount] from_mtab: processing mount Mount { mountpoint: "/boot/firmware", fs_type: "vfat" }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::mount] from_mtab: processing mount Mount { mountpoint: "/mnt/pen", fs_type: "vfat" }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info] new: Root device number is: 179:2
2024-02-28 13:54:53 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "/dev/mmcblk0p1: LABEL_FATBOOT=\"bootfs\" LABEL=\"bootfs\" UUID=\"5DF9-E225\" BLOCK_SIZE=\"512\" TYPE=\"vfat\" PARTUUID=\"d0eeea92-01\"\n", stderr: "" }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p1 got param name: LABEL_FATBOOT, value: bootfs
2024-02-28 13:54:53 WARN  [takeover::stage1::block_device_info::partition] unexpected parameter name found: 'LABEL_FATBOOT'
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p1 got param name: LABEL, value: bootfs
2024-02-28 13:54:53 WARN  [takeover::stage1::block_device_info::partition] unexpected parameter name found: 'LABEL'
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p1 got param name: UUID, value: 5DF9-E225
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p1 got param name: BLOCK_SIZE, value: 512
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p1 got param name: TYPE, value: vfat
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p1 got param name: PARTUUID, value: d0eeea92-01
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: for /dev/mmcblk0p1 got PartitionInfo { uuid: Some("5DF9-E225"), block_size: Some(512), fs_type: Some("vfat"), label: None, part_uuid: Some("d0eeea92-01") }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info] found  partition '"mmcblk0p1"' in '/sys/block/mmcblk0/mmcblk0p1'
2024-02-28 13:54:53 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "/dev/mmcblk0p2: LABEL=\"rootfs\" UUID=\"3b614a3f-4a65-4480-876a-8a998e01ac9b\" BLOCK_SIZE=\"4096\" TYPE=\"ext4\" PARTUUID=\"d0eeea92-02\"\n", stderr: "" }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p2 got param name: LABEL, value: rootfs
2024-02-28 13:54:53 WARN  [takeover::stage1::block_device_info::partition] unexpected parameter name found: 'LABEL'
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p2 got param name: UUID, value: 3b614a3f-4a65-4480-876a-8a998e01ac9b
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p2 got param name: BLOCK_SIZE, value: 4096
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p2 got param name: TYPE, value: ext4
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/mmcblk0p2 got param name: PARTUUID, value: d0eeea92-02
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: for /dev/mmcblk0p2 got PartitionInfo { uuid: Some("3b614a3f-4a65-4480-876a-8a998e01ac9b"), block_size: Some(4096), fs_type: Some("ext4"), label: None, part_uuid: Some("d0eeea92-02") }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info] found  partition '"mmcblk0p2"' in '/sys/block/mmcblk0/mmcblk0p2'
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info] new: got device: BlockDevice { name: "mmcblk0", device_num: DeviceNum { major: 179, minor: 0 }, mounted: None, parent: "None" }
2024-02-28 13:54:53 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "/dev/sda1: UUID=\"7607-9B6E\" BLOCK_SIZE=\"512\" TYPE=\"vfat\" PARTUUID=\"4e639091-01\"\n", stderr: "" }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/sda1 got param name: UUID, value: 7607-9B6E
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/sda1 got param name: BLOCK_SIZE, value: 512
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/sda1 got param name: TYPE, value: vfat
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: /dev/sda1 got param name: PARTUUID, value: 4e639091-01
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info::partition] PartitionInfo::new: for /dev/sda1 got PartitionInfo { uuid: Some("7607-9B6E"), block_size: Some(512), fs_type: Some("vfat"), label: None, part_uuid: Some("4e639091-01") }
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info] found  partition '"sda1"' in '/sys/block/sda/sda1'
2024-02-28 13:54:53 DEBUG [takeover::stage1::block_device_info] new: got device: BlockDevice { name: "sda", device_num: DeviceNum { major: 8, minor: 0 }, mounted: None, parent: "None" }
2024-02-28 13:54:53 DEBUG [takeover::stage1] Stage 2 config: 
---
log_dev:
  dev_name: /dev/sda1
  fs_type: vfat
log_level: DEBUG
flash_dev: /dev/mmcblk0
pretend: false
umount_parts:
  - dev_name: /dev/mmcblk0p1
    mountpoint: /boot/firmware
    fs_type: vfat
  - dev_name: /dev/mmcblk0p2
    mountpoint: /
    fs_type: ext4
work_dir: /home/lmb
image_path: /home/lmb/balena-cloud-raspberrypi3-64-5.0.8.img.gz
config_path: /home/lmb/config.Kpj6ZR7Hdf.json
backup_path: ~
tty: /dev/pts/0

2024-02-28 13:54:53 INFO  Wrote stage2 config to '/balena-takeover/stage2-config.yml'
2024-02-28 13:54:53 DEBUG [takeover::stage1] Found telinit at '/usr/sbin/telinit'
2024-02-28 13:54:53 DEBUG [takeover::stage1] telinit is a symlink with canonical path '/usr/bin/systemctl'
2024-02-28 13:54:53 INFO  Copied '/usr/bin/systemctl' to '/balena-takeover/bin/telinit'
2024-02-28 13:54:53 INFO  Bind-mounted new init as '/balena-takeover/bin/takeover'
2024-02-28 13:54:53 DEBUG [takeover::stage1] calling '/balena-takeover/bin/telinit u'
2024-02-28 13:54:55 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "", stderr: "" }
2024-02-28 13:54:55 INFO  Restarted init
2024-02-28 13:54:55 INFO  Takeover initiated successfully, please wait for the device to be reflashed and reboot

stage2-init.log:

2024-02-28 13:54:55 INFO  [takeover::init] Init entered
2024-02-28 13:54:55 INFO  [takeover::init] Init check pid success!
2024-02-28 13:54:55 INFO  [takeover::init] Stage 2 config was read successfully
2024-02-28 13:54:55 INFO  [takeover::init] Stage 2 closed 44 fd's
2024-02-28 13:54:55 WARN  [takeover::init] Failed to unmount log device '/dev/sda1' from '/mnt/pen', error: EBUSY
2024-02-28 13:54:55 INFO  [takeover::init] Now logging to '/balena-takeover/mnt/log/stage2-init.log' on '/dev/sda1'
2024-02-28 13:54:55 INFO  [takeover::init] Stage 2 setup_log success!, ext_log: true
2024-02-28 13:54:57 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "", stderr: "" }
2024-02-28 13:54:57 DEBUG [takeover::common] call: output: Output { status: ExitStatus(unix_wait_status(0)), stdout: "", stderr: "" }
2024-02-28 13:54:57 INFO  [takeover::init] Stage 2 migrate worker spawned

stage2.log:

2024-02-28 13:54:57 INFO  [takeover::stage2] Stage 2 migrate_worker entered
2024-02-28 13:54:57 INFO  [takeover::stage2] Stage 2 config was read successfully
2024-02-28 13:54:57 INFO  [takeover::stage2] Set logfile to /mnt/log/stage2.log
2024-02-28 13:54:58 DEBUG [takeover::stage2] active processes:
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      1 name: systemd   executable: /mnt/old_root/usr/lib/systemd/systemd   ppid: 0
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      2 name: '[kthreadd]'  executable: -   ppid: 0
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      3 name: '[rcu_gp]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      4 name: '[rcu_par_gp]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      5 name: '[slub_flushwq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      6 name: '[netns]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:      9 name: '[kworker/u8:0-flush-8:0]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     10 name: '[mm_percpu_wq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     11 name: '[rcu_tasks_kthread]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     12 name: '[rcu_tasks_rude_kthread]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     13 name: '[rcu_tasks_trace_kthread]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     14 name: '[ksoftirqd/0]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     15 name: '[rcu_preempt]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     16 name: '[migration/0]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     17 name: '[cpuhp/0]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     18 name: '[cpuhp/1]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     19 name: '[migration/1]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     20 name: '[ksoftirqd/1]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     23 name: '[cpuhp/2]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     24 name: '[migration/2]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     25 name: '[ksoftirqd/2]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     28 name: '[cpuhp/3]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     29 name: '[migration/3]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     30 name: '[ksoftirqd/3]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     31 name: '[kworker/3:0-mm_percpu_wq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     33 name: '[kdevtmpfs]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     34 name: '[inet_frag_wq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     36 name: '[kauditd]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     37 name: '[kworker/0:2-mm_percpu_wq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     38 name: '[khungtaskd]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     40 name: '[oom_reaper]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     41 name: '[writeback]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     42 name: '[kcompactd0]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     43 name: '[kintegrityd]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     44 name: '[kblockd]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     45 name: '[blkcg_punt_bio]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     46 name: '[watchdogd]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     47 name: '[kworker/u8:2-events_unbound]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     49 name: '[kworker/1:1H-mmc_complete]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     50 name: '[rpciod]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     51 name: '[xprtiod]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     53 name: '[kworker/2:1-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     55 name: '[kswapd0]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     56 name: '[nfsiod]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     57 name: '[kthrotld]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     62 name: '[hwrng]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     63 name: '[iscsi_conn_clea]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     64 name: '[nvme-wq]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     65 name: '[nvme-reset-wq]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     66 name: '[nvme-delete-wq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     67 name: '[dwc_otg]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     68 name: '[DWC Notificatio]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     69 name: '[uas]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     70 name: '[vchiq-slot/0]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     71 name: '[vchiq-recy/0]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     72 name: '[vchiq-sync/0]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:     79 name: '[zswap-shrink]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    104 name: '[mmc_complete]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    107 name: '[kworker/2:1H-kblockd]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    127 name: '[kworker/3:1H-mmc_complete]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    139 name: '[kworker/0:1H-kblockd]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    140 name: '[scsi_eh_0]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    141 name: '[scsi_tmf_0]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    142 name: '[usb-storage]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    154 name: '[kworker/1:2H]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    173 name: '[kworker/2:2H]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    174 name: '[kworker/1:3-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    180 name: '[irq/184-usb-001:006:01]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    183 name: '[kworker/0:2H-kblockd]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    186 name: '[jbd2/mmcblk0p2-8]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    187 name: '[ext4-rsv-conver]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    202 name: '[mld]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    203 name: '[ipv6_addrconf]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    246 name: systemd-journal   executable: /mnt/old_root/usr/lib/systemd/systemd-journald  ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    252 name: '[ext4lazyinit]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    253 name: '[kworker/2:3-events_freezable_power_]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    268 name: systemd-udevd     executable: /usr/bin/udevadm    ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    309 name: '[vchiq-keep/0]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    310 name: '[SMIO]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    359 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    360 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    361 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    362 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    366 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    367 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    368 name: '[mmal-vchiq]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    390 name: systemd-timesyn   executable: /usr/lib/systemd/systemd-timesyncd  ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    419 name: '[cfg80211]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    436 name: '[brcmf_wq/mmc1:0]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    437 name: '[brcmf_wdog/mmc1:0001:1]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    443 name: '[kworker/u9:2-brcmf_wq/mmc1:0001:1]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    454 name: avahi-daemon  executable: /mnt/old_root/usr/sbin/avahi-daemon     ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    455 name: cron  executable: /mnt/old_root/usr/sbin/cron     ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    456 name: dbus-daemon   executable: /mnt/old_root/usr/bin/dbus-daemon   ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    465 name: avahi-daemon  executable: /mnt/old_root/usr/sbin/avahi-daemon     ppid: 454
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    472 name: polkitd   executable: /mnt/old_root/usr/lib/polkit-1/polkitd  ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    495 name: '[cec-vc4-hdmi]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    496 name: '[irq/82-vc4 hdmi cec]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    501 name: systemd-logind    executable: /usr/lib/systemd/systemd-logind     ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    506 name: '[card0-crtc0]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    507 name: thd   executable: /mnt/old_root/usr/sbin/thd  ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    508 name: '[card0-crtc1]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    509 name: udisksd   executable: /mnt/old_root/usr/libexec/udisks2/udisksd   ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    510 name: '[card0-crtc2]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    511 name: '[card0-crtc3]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    535 name: ModemManager  executable: /usr/sbin/ModemManager  ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    537 name: NetworkManager    executable: /usr/sbin/NetworkManager    ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    538 name: wpa_supplicant    executable: /mnt/old_root/usr/sbin/wpa_supplicant   ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    661 name: '[kworker/1:4-mm_percpu_wq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    690 name: bluetoothd    executable: /usr/libexec/bluetooth/bluetoothd   ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:    736 name: '[kworker/u9:3-hci0]'     executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1081 name: '[kworker/0:1-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1082 name: '[kworker/3:0H]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1083 name: '[kworker/1:0-cgroup_destroy]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1084 name: '[kworker/3:2-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1085 name: '[kworker/0:0H-kblockd]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1209 name: '[kworker/u8:1-events_unbound]'   executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1210 name: '[kworker/u8:3-ext4-rsv-conversion]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1282 name: login     executable: /mnt/old_root/usr/bin/login     ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1335 name: '[kworker/1:1-mm_percpu_wq]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1336 name: '[kworker/2:0-cgroup_destroy]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1342 name: systemd   executable: /mnt/old_root/usr/lib/systemd/systemd   ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1343 name: '[kworker/2:2-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1344 name: (sd-pam)  executable: /mnt/old_root/usr/lib/systemd/systemd   ppid: 1342
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1359 name: bash  executable: /mnt/old_root/usr/bin/bash  ppid: 1282
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1375 name: '[kworker/2:0H]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1411 name: '[kworker/1:0H-kblockd]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1416 name: sudo  executable: /mnt/old_root/usr/bin/sudo  ppid: 1359
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1417 name: sudo  executable: /mnt/old_root/usr/bin/sudo  ppid: 1416
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1418 name: bash  executable: /mnt/old_root/usr/bin/bash  ppid: 1417
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1420 name: takeover  executable: /mnt/old_root/home/lmb/takeover     ppid: 1418
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1421 name: tee   executable: /mnt/old_root/usr/bin/tee   ppid: 1420
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1426 name: '[kworker/3:1-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1427 name: '[kworker/3:3-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1429 name: '[kworker/3:2H]'  executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1459 name: '[kworker/0:0-events]'    executable: -   ppid: 2
2024-02-28 13:54:58 DEBUG [takeover::stage2] pid:   1463 name: takeover  executable: /bin/takeover   ppid: 1
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 1, fd 0, file: '/proc/1/fd/0' -> 'pipe:[16241]'
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 1, fd 1, file: '/proc/1/fd/1' -> '/stdout.log'
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 1, fd 2, file: '/proc/1/fd/2' -> '/stderr.log'
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 1, fd 3, file: '/proc/1/fd/3' -> '/mnt/log/stage2-init.log'
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 1, fd 4, file: '/proc/1/fd/4' -> 'pipe:[16241]'
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 246, fd 0, file: '/proc/246/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 246
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 268, fd 0, file: '/proc/268/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 268
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 390, fd 0, file: '/proc/390/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 390
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 454, fd 0, file: '/proc/454/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 454
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 455, fd 0, file: '/proc/455/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 455
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 456, fd 0, file: '/proc/456/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 456
2024-02-28 13:54:58 DEBUG [takeover::common::system] looking at pid: 465, fd 0, file: '/proc/465/fd/0' -> '/mnt/old_root/dev/null'
2024-02-28 13:54:58 DEBUG [takeover::common::system] sending signal 15 to 465
2024-02-28 13:54:58 ERROR [takeover::stage2] kill_procs failed, error An error occurred in an upstream function, context: Failed to read link '/proc/472/fd/0'
  caused by: No such file or directory (os error 2)
lmbarros commented 7 months ago

Maybe a race condition? The process in question had the file descriptor (FD) open when we enumerated the list of open FDs, but it's no longer open when we try to access or use it?

If so, I believe this would be harmless (in the sense that it doesn't damage the system in any way, and a future migration attempt will probably work fine). But, yeah, annoying.

acostach commented 7 months ago

I encountered this today as well on a balenaOS to balenaOS test migration


2024-03-21 12:59:03 INFO  [takeover::stage2] Stage 2 migrate_worker entered
2024-03-21 12:59:03 INFO  [takeover::stage2] Stage 2 config was read successfully
2024-03-21 12:59:03 INFO  [takeover::stage2] Set logfile to /mnt/log/stage2.log
2024-03-21 12:59:03 TRACE [takeover::stage2] kill_procs: entered
2024-03-21 12:59:03 DEBUG [takeover::stage2] active processes:
2024-03-21 12:59:03 ERROR [takeover::stage2] kill_procs failed, error An error occurred in an upstream function, context: Failed to read link to executable: '/proc/7067/exe'
  caused by: No such file or directory (os error 2)
2024-03-21 12:59:03 TRACE [takeover::stage2] reboot entered
lmbarros commented 6 months ago

I have been looking into this, and unfortunately there are several race conditions around the fuser() function, all of which can trigger errors at different points. Here are the ones I identified:

  1. When we call read_dir(&fd_dir) to get the list of files opened by a given process, the process exists. But when we call lstat(curr_path.as_path()) this process has already died. This will trigger a "proper" lstat() error, which may be what Ken originally saw. This is currently handled by our code that ignores lstat errors (continue after lstat error).
  2. Similar to above: when we call read_dir(&fd_dir), the process has a given file open, but when we call lstat(curr_path.as_path()), the process has closed that file. That would be another possible cause for a "proper" lstat error, like the one Ken originally saw.
  3. Similar to above, but the file was still open when we called lstat. But then, the file is closed (or the whole process is killed) just after this lstat call, which then causes the read_link call to fail. This seems to be what happened to me when I first opened this issue.

And now for something completely different. The function get_process_infos (indirectly called just before we call fuser also has a race:

  1. It calls ProcessIterator::new() (which boils down to a read_dir on /proc), then iterates over the results calling get_process_info_for for each entry. If some process is killed before its entry is processed, get_process_info_for will call read_link on a path that doesn't exist anymore and fail. This appears to be what Alexandru saw.
lmbarros commented 6 months ago

For future reference, here's a patch I applied on top of current master (1ac8dd08c12369a184c953e34ac5e0883f634652), to reproduce the case 1 from my list above.

```patch diff --git a/src/common/system.rs b/src/common/system.rs index c55d624..a433328 100644 --- a/src/common/system.rs +++ b/src/common/system.rs @@ -1,12 +1,12 @@ use libc::{ - self, ino_t, mode_t, utsname, EACCES, EEXIST, EIO, ENOENT, ENXIO, EPERM, O_RDONLY, S_IFBLK, - S_IFCHR, S_IFDIR, S_IFIFO, S_IFLNK, S_IFMT, S_IFREG, S_IFSOCK, + self, ino_t, mode_t, utsname, EACCES, EEXIST, EIO, ENOENT, ENXIO, EPERM, O_RDONLY, SIGKILL, + S_IFBLK, S_IFCHR, S_IFDIR, S_IFIFO, S_IFLNK, S_IFMT, S_IFREG, S_IFSOCK, }; use std::collections::HashMap; -use std::fs::{copy, read_to_string, ReadDir}; +use std::fs::{copy, read_to_string, File, ReadDir}; use std::fs::{read_dir, read_link}; -use std::io; +use std::io::{self, Read}; use std::mem::MaybeUninit; use std::path::{Path, PathBuf}; @@ -291,12 +291,27 @@ pub(crate) fn fuser>( static ref DIR_REGEX: Regex = Regex::new(r"^.*/(\d+)$").unwrap(); } + // REPRO + let mut pid_file = + File::open("/mnt/old_root/repro.pid").upstream_with_context("Failed to open pid file")?; + let mut str_buffer = String::new(); + pid_file + .read_to_string(&mut str_buffer) + .upstream_with_context("Failed to read pid file")?; + let tail_pid = str_buffer + .trim() + .parse::() + .upstream_with_context("Failed to parse pid for repro")?; + + debug!("---> tail_pid: {}", tail_pid); + let mut sent_signals: Vec = Vec::new(); for proc_info in ProcessIterator::new()? { match proc_info { Ok((curr_pid, directory)) => { let fd_dir = path_append(&directory, "fd"); + // trace!("looking at pid={}, fd_dir={}", curr_pid, fd_dir.display()); for dir_entry in match read_dir(&fd_dir) { Ok(read_dir) => read_dir, Err(why) => { @@ -310,6 +325,18 @@ pub(crate) fn fuser>( } } } { + // REPRO: Kill the tail process before lstating its open files. + if curr_pid == tail_pid { + debug!("Killing tail process: {}", tail_pid); + let rcrcrc = unsafe { libc::kill(tail_pid, SIGKILL) }; + if rcrcrc != 0 { + warn!( + "Failed to kill tail process: {}, error: {}", + tail_pid, rcrcrc + ); + } + } + match dir_entry { Ok(dir_entry) => { let curr_path = dir_entry.path(); @@ -365,8 +392,8 @@ pub(crate) fn fuser>( )); } } - Err(_) => { - debug!("continue after lstat error"); + Err(err) => { + debug!("continue after lstat error: {}", err); } } } diff --git a/src/stage1.rs b/src/stage1.rs index 6f173bd..20f31e1 100644 --- a/src/stage1.rs +++ b/src/stage1.rs @@ -2,7 +2,7 @@ mod backup; use std::env::set_current_dir; use std::fs::{ - copy, create_dir, create_dir_all, read_dir, read_link, remove_dir_all, symlink_metadata, + copy, create_dir, create_dir_all, read_dir, read_link, remove_dir_all, symlink_metadata, File, OpenOptions, }; use std::io::Write; @@ -549,6 +549,14 @@ pub fn stage1(opts: &Options) -> Result<()> { Logger::set_brief_info(true); Logger::set_color(true); + let tail_for_repro = std::process::Command::new("tail") + .arg("-f") + .arg("/etc/os-release") + .spawn()?; + + let mut pid_file = File::create("/repro.pid")?; + write!(pid_file, "{}", tail_for_repro.id())?; + /* if opts.config().is_none() { let mut clap = Options::clap(); ``` When running this I see `fstat` failing and being ignored in the stage 2 logs: ``` 2024-04-23 18:10:59 TRACE [takeover::common::system] fuser: entered with '/mnt/old_root', 15 2024-04-23 18:10:59 DEBUG [takeover::common::system] ---> tail_pid: 794 [...] 024-04-23 18:10:59 DEBUG [takeover::common::system] Killing tail process: 794 2024-04-23 18:10:59 DEBUG [takeover::common::system] continue after lstat error: A file could not be found, context: libc::lstat failed for path: '/proc/794/fd/0' caused by: No such file or directory (os error 2) 2024-04-23 18:10:59 DEBUG [takeover::common::system] Killing tail process: 794 2024-04-23 18:10:59 WARN [takeover::common::system] Failed to kill tail process: 794, error: -1 2024-04-23 18:10:59 DEBUG [takeover::common::system] continue after lstat error: A file could not be found, context: libc::lstat failed for path: '/proc/794/fd/1' caused by: No such file or directory (os error 2) 2024-04-23 18:10:59 DEBUG [takeover::common::system] Killing tail process: 794 2024-04-23 18:10:59 WARN [takeover::common::system] Failed to kill tail process: 794, error: -1 2024-04-23 18:10:59 DEBUG [takeover::common::system] continue after lstat error: A file could not be found, context: libc::lstat failed for path: '/proc/794/fd/2' caused by: No such file or directory (os error 2) 2024-04-23 18:10:59 DEBUG [takeover::common::system] Killing tail process: 794 2024-04-23 18:10:59 WARN [takeover::common::system] Failed to kill tail process: 794, error: -1 2024-04-23 18:10:59 DEBUG [takeover::common::system] continue after lstat error: A file could not be found, context: libc::lstat failed for path: '/proc/794/fd/3' caused by: No such file or directory (os error 2) ```
lmbarros commented 6 months ago

Now, what do we do about all that? Need to think a bit more, but currently what I have in mind is this:

lmbarros commented 6 months ago

...which makes me think of the opposite kind of races that go on undetected: what if processes are created or existing processes open files between the critical points?

As far as I understand, all these racy process iteration we do is to put the system on a state that is as idle as possible (no unnecessary processes running, no unnecessary files open) in order to minimize the chances of flashing errors. But generally speaking, I don't think that a surviving process would cause the migration to fail, so I want to believe we can live with that.

(Still, I said "generally speaking". I can imagine that a surviving process that is allocating huge amounts of memory, or maybe also trying to overwrite the storage device could cause problems. We never saw anything like that in testing, and as far as I know we never had reports of this kind. I'd expect these cases to be beyond rare.)

lmbarros commented 5 months ago

Case 4 (the one reported by Alexandru) is actually puzzling me. The error message he saw was

2024-03-21 12:59:03 ERROR [takeover::stage2] kill_procs failed, error An error occurred in an upstream function, context: Failed to read link to executable: '/proc/7067/exe'
  caused by: No such file or directory (os error 2)

AFAICT, the only place this can possible originate from is here, and because read_link() returned ENOENT (AKA "os error 2"). However, if look at the code, this case seems to be handled already (for the /proc/PID/exe case, not for /proc/PID/root).

The relevant code hasn't changed in a while, so it seems I am looking at the same code that Alexandru ran. I don't understand how he he could possibly see that error message.

[UPDATE] 23 hours later...

Now I understand. The original code had copy-and-pasted ("copied-and-pasted"?) the error handling for the "exe" case into the "root" case. So, the error reporting was wrong and confused me. It was the "root" case failing for Alexandru all the time.

Knots being disentangled!

lmbarros commented 5 months ago

And I found another race condition:

  1. get_process_infos() will use ProcessIterator::new() to get the list of running processes, and then (via parse_status()) call get_process_info_for() for each entry in. This will in turn read /proc/PID/status, which may be vanished by then.
rahul-thakoor commented 5 months ago

hey @lmbarros this was handled by #75 right?

lmbarros commented 5 months ago

Solved by #75