canonical / mir-kiosk

Definition of the mir-kiosk snap
19 stars 7 forks source link

Updating device causes mir-kiosk to crash #60

Closed daleknauss closed 3 years ago

daleknauss commented 3 years ago

Hello,

As part of our system, we apply firmware updates to a number of attached devices. Since around the time of the 2.4.0-snap124 snap release, we have started to see mir-kiosk crash when applying an update. Here are the relevant logs:

Jun 25 00:17:26 FMTRMRD29000003 mir-kiosk.daemon[5160]: ERROR: /build/mir-iJR6Nn/mir-2.4.0/src/platform/udev/udev_wrapper.cpp(60): Throw in function {anonymous}::DeviceImpl::DeviceImpl(udev_device*)
Jun 25 00:17:26 FMTRMRD29000003 mir-kiosk.daemon[5160]: Dynamic exception type: boost::wrapexcept<std::runtime_error>
Jun 25 00:17:26 FMTRMRD29000003 mir-kiosk.daemon[5160]: std::exception::what: Udev device does not exist

Any ideas?

RAOF commented 3 years ago

Hm. That looks like Mir querying details of a udev device that has gone away. But I can't think of any relevant code that has changed. Perhaps unrelated changes have changed the timing and exposed a race?

daleknauss commented 3 years ago

@RAOF as part of the update, the device would go away and reappear so I think your assessment is correct. Please let me know if there are any other details I can provide that might help with this issue.

AlanGriffiths commented 3 years ago

I concur with @RAOF: there are no recent changes in this code, so I suspect the change is unrelated to the 2.4 release.

The root cause of the crash would appear to be that the exception shown is not being handled. Unfortunately, the above log snippet doesn't indicate the context in which the "Udev device does not exist". Is there anything that might be interesting in the log immediately before or after this?

daleknauss commented 3 years ago

Here are the rest of the logs I could pull out starting from when the update happens. Those unbind and bind failures are fairly common even before this issue.

A few other thoughts: I've reverted to a snap version that was heavily tested and working fine before the latest mir-kisok snap was released. However, I haven't yet done the work to check if one of our Canonical maintained snaps has been updated in the same window. It's possible that a change to another snap might be the root cause. I will be exploring that further.

Jun 23 00:48:52 FMTRMRD2A000045 kernel: uvcvideo: Found UVC 1.10 device (3301:1000)
Jun 23 00:48:52 FMTRMRD2A000045 kernel: input: Device: Device as /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/input/input20
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6688]: Process '/usr/lib/snapd/snap-device-helper unbind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6688]: Process '/usr/lib/snapd/snap-device-helper unbind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6688]: Process '/usr/lib/snapd/snap-device-helper unbind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6688]: Process '/usr/lib/snapd/snap-device-helper bind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6688]: Process '/usr/lib/snapd/snap-device-helper bind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6688]: Process '/usr/lib/snapd/snap-device-helper bind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6686]: Process '/usr/lib/snapd/snap-device-helper unbind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6686]: Process '/usr/lib/snapd/snap-device-helper unbind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6686]: Process '/usr/lib/snapd/snap-device-helper unbind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 kernel: usb 1-6: USB disconnect, device number 3
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6686]: Process '/usr/lib/snapd/snap-device-helper bind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6686]: Process '/usr/lib/snapd/snap-device-helper bind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6686]: Process '/usr/lib/snapd/snap-device-helper bind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-logind[1902]: Failed to open /dev/input/event8: No such device
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6732]: Process '/usr/lib/snapd/snap-device-helper unbind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6732]: Process '/usr/lib/snapd/snap-device-helper unbind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6730]: Process '/usr/lib/snapd/snap-device-helper unbind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6732]: Process '/usr/lib/snapd/snap-device-helper unbind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6730]: Process '/usr/lib/snapd/snap-device-helper unbind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6730]: Process '/usr/lib/snapd/snap-device-helper unbind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0 0:0' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6730]: Process '/usr/lib/snapd/snap-device-helper unbind snap-1 /devices/pci0000:00/0000:00:14.0/usb1/1-6 189:2' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6730]: Process '/usr/lib/snapd/snap-device-helper unbind snap-2 /devices/pci0000:00/0000:00:14.0/usb1/1-6 189:2' failed with exit code 1.
Jun 23 00:48:52 FMTRMRD2A000045 systemd-udevd[6730]: Process '/usr/lib/snapd/snap-device-helper unbind snap-3 /devices/pci0000:00/0000:00:14.0/usb1/1-6 189:2' failed with exit code 1.
Jun 23 00:48:53 FMTRMRD2A000045 mir-kiosk.daemon[2132]: [2021-06-23 00:48:53.170991] < - debug - > mirserver: Handling Terminated from pid=2132
Jun 23 00:48:53 FMTRMRD2A000045 mir-kiosk.daemon[2132]: ERROR: /build/mir-iJR6Nn/mir-2.4.0/src/platform/udev/udev_wrapper.cpp(60): Throw in function {anonymous}::DeviceImpl::DeviceImpl(udev_device*)
Jun 23 00:48:53 FMTRMRD2A000045 mir-kiosk.daemon[2132]: Dynamic exception type: boost::wrapexcept<std::runtime_error>
Jun 23 00:48:53 FMTRMRD2A000045 mir-kiosk.daemon[2132]: std::exception::what: Udev device does not exist
Jun 23 00:48:53 FMTRMRD2A000045 systemd[1]: snap.mir-kiosk.daemon.service: Main process exited, code=exited, status=1/FAILURE
Jun 23 00:48:53 FMTRMRD2A000045 systemd[1]: snap.mir-kiosk.daemon.service: Failed with result 'exit-code'.
Jun 23 00:48:54 FMTRMRD2A000045 kernel: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
AlanGriffiths commented 3 years ago

@daleknauss @RAOF came up with a likely fix that I've incorporated into a new "2.4.1~rc2-snap124" build on the beta channel.

If you have got an environment where you can reproduce the problem as it happens for you, please give this version a try.

It's not yet available for all architectures, but it should be there soon. I'm reaching EOD so letting you know now.

daleknauss commented 3 years ago

@AlanGriffiths I'm still seeing that crash on 2.4.1~rc2-snap124 (7239):

2021-06-29T18:50:05Z mir-kiosk.daemon[7811]: [2021-06-29 18:50:05.313501] < -warning- > evdev-input: Attempted to remove unmanaged fd 43
2021-06-29T18:50:09Z mir-kiosk.daemon[7811]: [2021-06-29 18:50:09.377665] < - debug - > mirserver: Handling Terminated from pid=7811
2021-06-29T18:50:09Z mir-kiosk.daemon[7811]: ERROR: /build/mir-ES5QiO/mir-2.4.1~rc2-g1b8f979fcb/src/platform/udev/udev_wrapper.cpp(60): Throw in function {anonymous}::DeviceImpl::DeviceImpl(udev_device*)
2021-06-29T18:50:09Z mir-kiosk.daemon[7811]: Dynamic exception type: boost::wrapexcept<std::runtime_error>
2021-06-29T18:50:09Z mir-kiosk.daemon[7811]: std::exception::what: Udev device does not exist
2021-06-29T18:50:09Z systemd[1]: snap.mir-kiosk.daemon.service: Main process exited, code=exited, status=1/FAILURE
AlanGriffiths commented 3 years ago

@AlanGriffiths I'm still seeing that crash on 2.4.1~rc2-snap124

Thanks. I'll take another look armed with that additional information. (I'm pretty sure we fixed a problem, a shame it wasn't the problem you're seeing.)

AlanGriffiths commented 3 years ago

Oops: Overzealous automated close!

AlanGriffiths commented 3 years ago

@daleknauss there's now an ~rc3 release with an additional fix.

I'm still failing to reproduce the actual problem you're seeing. Any hints you might have on how to reproduce would be welcome.

daleknauss commented 3 years ago

@AlanGriffiths 2.4.1~rc3-snap124 appears to have resolved the issue. Thank you for resolving this so quickly.

Roughly how long should we expect these fixes to make it onto stable?

AlanGriffiths commented 3 years ago

@AlanGriffiths 2.4.1~rc3-snap124 appears to have resolved the issue.

Thanks @daleknauss, I need to complete some tests and then I'll tag for release. The time to stable then depends on schedules I don't control, but should be of the order of two weeks.

Meanwhile, I'd like to confirm our understanding of what went wrong. You should be getting some "warning" messages in the log (instead of a crash) with the text "Failed to handle UDev". Can you confirm that and share the log for a few lines before and after?

AlanGriffiths commented 3 years ago

@daleknauss one further thought: were you aware of the "Call for testing"?

AlanGriffiths commented 3 years ago

@daleknauss FYI I have reproduced. As a result, I found the log warning message leaves something to be desired (https://github.com/MirServer/mir/pull/2095), but fixing that can wait for the next full release.

daleknauss commented 3 years ago

Thanks Alan. Definitely aware of calls for testing but don't always catch them when they're happening. Thank you. I'll try to see if I can get any more details on our end.