balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
968 stars 115 forks source link

Huawei USB 4G/LTE Modems is no longer switched to Network mode after update to 2.80.3+rev1 #2526

Open bernhardkaindl opened 2 years ago

bernhardkaindl commented 2 years ago

@pdcastro Do you know who could look at the compiled tcl script which segfaults and maybe update the compiler for it?

usb_modeswitch works fine on the default BalenaOS 2.58.3 for the Revolution Pi Connect (revpi-connect).

But with the currently new version 2.80.3+rev1, installed by update thru the balena-cloud dashboard, on one device (not on another), usb_modeswitch_displatcher is aborted due to a malloc bug. The location is inside a systemd service which is triggered by the udev rule after inserting an USB device which should be switched to another USB mode.

The abort is reproduced by starting the systemd service (even without HW) and the usb_modeswitch_displatcher works fine when it is called from a interactive ssh shell instead.

This is the call from the udev rule: systemctl --no-block start usb_modeswitch@1-1.2:1.0

This newer BalenaOS uses glibc-2.31 instead of glibc-2.29, not sure if that is related. And the program is orignally a TCL script which is compiled to an ELF executable (likely to not need a full tcl enviroment). On Ubuntu, the TCL script is installed instead, so the issue might be with the TCL-compilation.

The error message is printed by the malloc function of glibc: malloc(): invalid next size (unsorted)

usb_modeswitch_dispatcher[29588]: malloc(): invalid next size (unsorted)
systemd[1]: usb_modeswitch@1-1.2:1.0.service: Main process exited, code=killed, status=6/ABRT
systemd[1]: usb_modeswitch@1-1.2:1.0.service: Failed with result 'signal'.
systemd[1]: Failed to start USB_ModeSwitch_1-1.2:1.0.

I enabled EnableLogging=1 in /etc/usb_modeswitch.conf:

cat /var/log/usb_modeswitch_1-1.2

USB_ModeSwitch log from Wed Mar  02 20:08:18 UTC 2022

Use global config file: /etc/usb_modeswitch.conf
Raw parameters: {--switch-mode} {1-1.2:1.0} 
Use top device dir /sys/bus/usb/devices/1-1.2
Check class of first interface ...
 Interface 0 class is 08.

----------------
USB values from sysfs:
  manufacturer  HUAWEI_MOBILE
  product       HUAWEI_MOBILE
  serial        0123456789ABCDEF
----------------
bNumConfigurations is 1 - don't check for active configuration

ConfigList: /usr/share/usb_modeswitch/12d1:1f01
SCSI attributes not needed, move on
Check config: /usr/share/usb_modeswitch/12d1:1f01
! matched. Read config data

An strace log of this aborted run, captured from inside the systemd service, is attached here (the issue does not trigger ouside of this systemd service yet): strace.txt

ssh rev cat /etc/os-release 
ID="balena-os"
NAME="balenaOS"
VERSION="2.80.3+rev1"
VERSION_ID="2.80.3+rev1"
PRETTY_NAME="balenaOS 2.80.3+rev1"
MACHINE="revpi-connect"
VARIANT="Production"
VARIANT_ID=prod
META_BALENA_VERSION="2.80.3"
BALENA_BOARD_REV="1a078c7"
META_BALENA_REV="914c2494"
SLUG="revpi-connect"

Again, this does not happen on another revpi-connect, but it is reproducible.

bernhardkaindl commented 2 years ago

These are the last words from strace -f, the subprocesses are calls to /bin/echo, the program is an automatic conversion of the original TCL(sigh) script into an ELF exectuable to not need a full TCL interpreter:

[pid 18158] write(1, "! matched. Read config data\n", 28) = 28
[pid 18158] close(1)                    = 0            
[pid 18158] close(2)                    = 0            
[pid 18158] exit_group(0)               = ?            
[pid 18158] +++ exited with 0 +++
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 18158
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=18158, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
_llseek(6, 0, [0], SEEK_SET)            = 0            
fcntl64(6, F_GETFL)                     = 0x2 (flags O_RDWR) 
fstat64(6, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
writev(2, [{"malloc(): invalid next size (unsorted)", 38}, {"\n", 1}], 2
writev(2, [{"malloc(): invalid next siz
writev(2, [{"malloc(): invalid next siz
 = 39
mmap2(NULL, 4096, PROT_READ|PROT_) = 0x76f08000
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 18139        
gettid()                                = 18139        
tgkill(18139, 18139, SIGABRT)           = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=18139, si_uid=0} ---

The correspoding lines in the real tcl script are:

# Now check for a matching config file. Matching is done
# by MatchDevice

set report ""
foreach mconfig $configList {

  # skipping installer leftovers like "*.rpmnew"
  if [regexp {\.(dpkg|rpm)} $mconfig] {continue}

  Log "Check config: $mconfig"
  if [MatchDevice $mconfig] {
    Log "! matched. Read config data"
    set flags(config) [ConfigGet conffile $mconfig]
    break
  } else {
    Log "* no match, don't use this config"
  }
}
if {$flags(config) == ""} {
  Log "No matching config file found. Exit"
  SafeExit
}

ParseDeviceConfig $flags(config)

The content of the config file which matches (logged before, not shown here) on the device is:

cat /usr/share/usb_modeswitch/12d1:1f01
# Huawei E353 (3.se) and others
TargetVendor=0x12d1
TargetProductList="14db,14dc"
HuaweiNewMode=1

It is correct and usb_modeswitch_dispatcher works even on the same device when called from the shell and not from /lib/systemd/system/usb_modeswitch@.service (/tmp/strace -s99 -f was only added the get the logs for this comment):

[Unit]
Description=USB_ModeSwitch_%i

[Service]
Type=oneshot
ExecStart=/tmp/strace -s99 -f /usr/sbin/usb_modeswitch_dispatcher --switch-mode %i
#ExecStart=/bin/echo %i
splitice commented 1 year ago

Version 2.6.0, (from 2019/11/28) does away with the bundled libjim, perhaps this will help

splitice commented 1 year ago

Testing 2.6.1 over our fleet with jimtcl 0.81, so far so good.