dankamongmen / growlight

notcurses block device manager / system installation tool
https://nick-black.com/dankwiki/index.php/Growlight
GNU General Public License v3.0
85 stars 12 forks source link

Assertion in close_safe() down in libatasmart -> libudev #12

Closed dankamongmen closed 4 years ago

dankamongmen commented 5 years ago

If I sit there and rapidly start and kill sudo ./growlight-curses, about one out of ten times I'll get this:

╭──────[nvme-0 [0] (32Gbps to Southbridge)]────────────────────────────────[-]─╮
│    nvme0n1┌⇗⇨⇨⇨empty space──────────────────────────────────────────────────┐│
│ solidstate│eeeeeeeeeeeeeeeeeeee 18446.7 empty space eeeeeeeeeeeeeeeeeeee123m││
│38╭─────────────────╮X0C-00SJ  n/a   1.00T  512B gpt   1908E1805012     NVMe├┘│
╰──│                 │ice 0000:04.00.0 (x4, gen 3.0)]──────────────────────────╯
   │ Initializing... │
   │Assertion 'close_noAborted) != -EBADF' failed at ../src/basic/fd-util.c:67, [schwarzgerat](134) $  Aborting.

that src/basic/fd-util.c business is from system-fucking-d, which how the hell is it getting involved?

dankamongmen commented 5 years ago

Can also happen using readline:

[schwarzgerat](0) $ sudo ./growlight-readline 
No ZFS support in this build.
Short read 0/512 from sdm
Assertion 'close_nointr(fd) != -EBADF' failed at ../src/basic/fd-util.c:67, function safe_close(). Aborting.
Aborted
[schwarzgerat](134) $ 
dankamongmen commented 5 years ago

Running with -v shows this to be a bit late into the startup procedure:

[schwarzgerat](1) $ sudo ./growlight-readline -v
growlight 1.0.6.1
libblkid 2.34.0, libpci 0x30602, libdm 1.02.155 (2018-12-18), glibc 2.28 stable
No ZFS support in this build.
Watching /sys/class/block/ on fd 1
scanning /sys/class/block/ on 8...
sdf -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:1/end_device-10:1/target10:0:1/10:0:1:0/block/sdf
sdk9 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:6/end_device-10:6/target10:0:6/10:0:6:0/block/sdk/sdk9
sdh9 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:3/end_device-10:3/target10:0:3/10:0:3:0/block/sdh/sdh9
sdj9 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:5/end_device-10:5/target10:0:5/10:0:5:0/block/sdj/sdj9
sdo -> ../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host11/target11:0:0/11:0:0:3/block/sdo
sdm -> ../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host11/target11:0:0/11:0:0:1/block/sdm
sdb -> ../../devices/pci0000:00/0000:00:1f.2/ata5/host4/target4:0:0/4:0:0:0/block/sdb
Numa node 0 (/sys/devices/pci0000:00/0000:00:14.0//numa_node)
/sys/class/block/ blocks on 48 devices for up to 1s
    Controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller
sdd -> ../../devices/pci0000:00/0000:00:1f.2/ata8/host7/target7:0:0/7:0:0:0/block/sdd
    Model: SD/MMC revision 1.00 S/N n/a
sdg9 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:2/end_device-10:2/target10:0:2/10:0:2:0/block/sdg/sdg9
Numa node 0 (/sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0//numa_node)
Block R/O flag: 0 (sdm)
    sdm read-write-verify: Not present
Couldn't probe sdm SMART
Short read 0/512 from sdm
Couldn't read MBR for sdm
Couldn't get blkid probe for /dev/sdm (No medium found)
    Device is unloaded/inaccessible
    Logical sector size: 512B Physical sector size: 512B
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Not recursing on partition sdk9
md127 -> ../../devices/virtual/block/md127
nvme1n1p2 -> ../../devices/pci0000:00/0000:00:03.0/0000:02:00.0/nvme/nvme1/nvme1n1/nvme1n1p2
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Not recursing on partition sdh9
sdf9 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:1/end_device-10:1/target10:0:1/10:0:1:0/block/sdf/sdf9
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Not recursing on partition sdj9
sdi9 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:4/end_device-10:4/target10:0:4/10:0:4:0/block/sdi/sdi9
sde9 -> ../../devices/pci0000:00/0000:00:1f.2/ata10/host9/target9:0:0/9:0:0:0/block/sde/sde9
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Not recursing on partition sdi9
sdd9 -> ../../devices/pci0000:00/0000:00:1f.2/ata8/host7/target7:0:0/7:0:0:0/block/sdd/sdd9
sdc9 -> ../../devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0/block/sdc/sdc9
sdb9 -> ../../devices/pci0000:00/0000:00:1f.2/ata5/host4/target4:0:0/4:0:0:0/block/sdb/sdb9
Numa node 0 (/sys/devices/pci0000:00/0000:00:1f.2//numa_node)
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
Not recursing on partition sdc9
sdk -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:6/end_device-10:6/target10:0:6/10:0:6:0/block/sdk
sda9 -> ../../devices/pci0000:00/0000:00:11.4/ata2/host1/target1:0:0/1:0:0:0/block/sda/sda9
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
pktcdvd0 -> ../../devices/virtual/block/pktcdvd0
    Controller: Virtual devices
    Model: ST12000NM0007-2A revision SN02 S/N n/a
nvme0n1p3 -> ../../devices/pci0000:00/0000:00:02.3/0000:04:00.0/nvme/nvme0/nvme0n1/nvme0n1p3
    Logical sector size: 2048B Physical sector size: 2048B
    Partition 9 at sdb9
sdi -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:4/end_device-10:4/target10:0:4/10:0:4:0/block/sdi
sdk1 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:6/end_device-10:6/target10:0:6/10:0:6:0/block/sdk/sdk1
nvme0n1p1 -> ../../devices/pci0000:00/0000:00:02.3/0000:04:00.0/nvme/nvme0/nvme0n1/nvme0n1p1
    Partition 1 at sdb1
sdj1 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:5/end_device-10:5/target10:0:5/10:0:5:0/block/sdj/sdj1
sr0 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:0/end_device-10:0/target10:0:0/10:0:0:0/block/sr0
Block R/O flag: 0 (sdb)
Numa node 0 (/sys/devices/pci0000:00/0000:00:02.3/0000:04:00.0//numa_node)
    sdb write-cache: Enabled
    sdb read-write-verify: Disabled
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
Not recursing on partition nvme0n1p1
sdg -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:2/end_device-10:2/target10:0:2/10:0:2:0/block/sdg
sdh1 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:3/end_device-10:3/target10:0:3/10:0:3:0/block/sdh/sdh1
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
sde -> ../../devices/pci0000:00/0000:00:1f.2/ata10/host9/target9:0:0/9:0:0:0/block/sde
sdg1 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:2/end_device-10:2/target10:0:2/10:0:2:0/block/sdg/sdg1
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
    Partition 9 at sdg9
nvme1n1p3 -> ../../devices/pci0000:00/0000:00:03.0/0000:02:00.0/nvme/nvme1/nvme1n1/nvme1n1p3
sdd1 -> ../../devices/pci0000:00/0000:00:1f.2/ata8/host7/target7:0:0/7:0:0:0/block/sdd/sdd1
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Partition 1 at sdg1
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
Not recursing on partition sdd1
    Partition 9 at sde9
Block R/O flag: 0 (sdg)
sdn -> ../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host11/target11:0:0/11:0:0:2/block/sdn
    Partition 1 at sde1
Block R/O flag: 0 (sde)
sdb1 -> ../../devices/pci0000:00/0000:00:1f.2/ata5/host4/target4:0:0/4:0:0:0/block/sdb/sdb1
sdc -> ../../devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0/block/sdc
sda1 -> ../../devices/pci0000:00/0000:00:11.4/ata2/host1/target1:0:0/1:0:0:0/block/sda/sda1
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
Not recursing on partition sdb1
nvme1n1 -> ../../devices/pci0000:00/0000:00:03.0/0000:02:00.0/nvme/nvme1/nvme1n1
sdl -> ../../devices/pci0000:00/0000:00:14.0/usb1/1-8/1-8:1.0/host11/target11:0:0/11:0:0:0/block/sdl
sda -> ../../devices/pci0000:00/0000:00:11.4/ata2/host1/target1:0:0/1:0:0:0/block/sda
sdj -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:5/end_device-10:5/target10:0:5/10:0:5:0/block/sdj
sdh -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:3/end_device-10:3/target10:0:3/10:0:3:0/block/sdh
nvme0n1p2 -> ../../devices/pci0000:00/0000:00:02.3/0000:04:00.0/nvme/nvme0/nvme0n1/nvme0n1p2
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Controller: Virtual devices
Not recursing on partition sdg9
    Partition 9 at sdd9
Numa node 0 (/sys/devices/pci0000:00/0000:00:03.0/0000:02:00.0//numa_node)
    Partition 1 at sdd1
Block R/O flag: 0 (sdd)
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
Not recursing on partition nvme1n1p2
    Controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller
    Model: xD-Picture revision 1.02 S/N n/a
nvme0n1 -> ../../devices/pci0000:00/0000:00:02.3/0000:04:00.0/nvme/nvme0/nvme0n1
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
Not recursing on partition sde9
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
Not recursing on partition sdd9
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
Not recursing on partition sdb9
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Block R/O flag: 0 (sdo)
    Model: ST12000NM0007-2A revision SN03 S/N n/a
Numa node 0 (/sys/devices/pci0000:00/0000:00:11.4//numa_node)
    Partition 9 at sdk9
    Partition 1 at sdk1
    Controller: Intel Corporation C610/X99 series chipset sSATA Controller [AHCI mode]
Not recursing on partition sda9
Block R/O flag: 0 (sdk)
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
Not recursing on partition nvme0n1p3
    sde write-cache: Enabled
    sdg write-cache: Enabled
    sde read-write-verify: Disabled
    sdo read-write-verify: Not present
    sdg read-write-verify: Disabled
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Model: ST12000NM0007-2A revision SN02 S/N n/a
sdi1 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:4/end_device-10:4/target10:0:4/10:0:4:0/block/sdi/sdi1
Not recursing on partition sdk1
    Partition 9 at sdi9
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Partition 1 at sdi1
Not recursing on partition sdj1
Block R/O flag: 0 (sdi)
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Couldn't probe sdo SMART
Short read 0/512 from sdo
Couldn't read MBR for sdo
sdf1 -> ../../devices/pci0000:00/0000:00:01.0/0000:05:00.0/host10/port-10:1/end_device-10:1/target10:0:1/10:0:1:0/block/sdf/sdf1
    Model: iHBS112   2 revision CL0F S/N n/a
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
sde1 -> ../../devices/pci0000:00/0000:00:1f.2/ata10/host9/target9:0:0/9:0:0:0/block/sde/sde1
Not recursing on partition sdh1
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
Not recursing on partition nvme1n1p3
sdc1 -> ../../devices/pci0000:00/0000:00:1f.2/ata6/host5/target5:0:0/5:0:0:0/block/sdc/sdc1
    Controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller
    sdd write-cache: Enabled
nvme1n1p1 -> ../../devices/pci0000:00/0000:00:03.0/0000:02:00.0/nvme/nvme1/nvme1n1/nvme1n1p1
    sdd read-write-verify: Disabled
    Controller: Intel Corporation C610/X99 series chipset 6-Port SATA Controller [AHCI mode]
    Model: MS/MS-PRO revision 1.03 S/N n/a
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Controller: Intel Corporation C610/X99 series chipset sSATA Controller [AHCI mode]
Couldn't get blkid probe for /dev/sdo (No medium found)
    Device is unloaded/inaccessible
    Logical sector size: 512B Physical sector size: 512B
Not recursing on partition sda1
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
    Partition 9 at sdc9
Couldn't get a revision for nvme1n1 (No such file or directory)
    Model: WDS100T3X0C-00SJG0 revision n/a S/N n/a
    Controller: Intel Corporation C610/X99 series chipset USB xHCI Host Controller
    Partition 2 at nvme1n1p2
    Partition 1 at sdc1
    Partition 3 at nvme1n1p3
    Model: Compact Flash revision 1.01 S/N n/a
    Controller: Intel Corporation C610/X99 series chipset sSATA Controller [AHCI mode]
Block R/O flag: 0 (sdc)
    Partition 1 at nvme1n1p1
Block R/O flag: 0 (nvme1n1)
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Partition 9 at sdj9
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Partition 1 at sdj1
Block R/O flag: 0 (sdn)
    Partition 9 at sda9
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
Block R/O flag: 0 (sdj)
    Model: ST12000NM0007-2A revision SN02 S/N n/a
Not recursing on partition nvme0n1p2
    Partition 1 at sda1
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Partition 9 at sdh9
Block R/O flag: 0 (sda)
    sdk write-cache: Enabled
Not recursing on partition sdf9
    sdk read-write-verify: Disabled
Block R/O flag: 0 (sdl)
    Controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
    Partition 1 at sdh1
Couldn't get a revision for nvme0n1 (No such file or directory)
    Model: WDS100T3X0C-00SJG0 revision n/a S/N n/a
Block R/O flag: 0 (sdh)
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Partition 3 at nvme0n1p3
    Model: ST12000NM0007-2A revision SN02 S/N n/a
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
    Partition 1 at nvme0n1p1
    sdi write-cache: Enabled
Not recursing on partition sdi1
    sdi read-write-verify: Disabled
    sdn read-write-verify: Not present
    Partition 9 at sdf9
    Controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Assertion 'close_nointr(fd) != -EBADF' failed at ../src/basic/fd-util.c:67, function safe_close(). Aborting.
Not recursing on partition sdf1
Aborted
[schwarzgerat](134) $ 
dankamongmen commented 4 years ago

OK, this is starting to make some sense. We're getting the signal while various threads are calling into libatasmart. The thread receiving the signal is deep in libudev via libatasmart, and libudev is built from...[drum roll] systemd. So it's not some kind of voodoo parent root process craziness, hopefully, but just a problem down in a linear call chain.

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7439535 in __GI_abort () at abort.c:79
#2  0x00007ffff7aa880a in ?? () from /lib/x86_64-linux-gnu/libudev.so.1
#3  0x00007ffff7aa96c2 in ?? () from /lib/x86_64-linux-gnu/libudev.so.1
#4  0x00007ffff7ab0ea8 in ?? () from /lib/x86_64-linux-gnu/libudev.so.1
#5  0x00007ffff7ab4be0 in ?? () from /lib/x86_64-linux-gnu/libudev.so.1
#6  0x00007ffff7ab52f4 in ?? () from /lib/x86_64-linux-gnu/libudev.so.1
#7  0x00007ffff7ab8ddc in ?? () from /lib/x86_64-linux-gnu/libudev.so.1
#8  0x00007ffff7ac065b in udev_device_get_parent_with_subsystem_devtype ()
   from /lib/x86_64-linux-gnu/libudev.so.1
#9  0x00007ffff7ad4205 in sk_disk_open ()
   from /usr/lib/x86_64-linux-gnu/libatasmart.so.4
#10 0x0000555555569518 in probe_smart (d=d@entry=0x7fffb0000b60)
    at src/smart.c:29
#11 0x000055555555d5b0 in rescan (d=0x7fffb0000b60, name=0x5555555aa270 "sdb")
    at src/growlight.c:992
#12 rescan (name=<optimized out>, d=<optimized out>) at src/growlight.c:908
#13 0x000055555555e3cc in create_new_device_inner (name=0x5555555aa270 "sdb")
    at src/growlight.c:1144
#14 create_new_device (name=name@entry=0x5555555aa270 "sdb")
    at src/growlight.c:1186
#15 0x000055555555e615 in lookup_device (name=0x5555555aa270 "sdb")
    at src/growlight.c:1256
#16 lookup_device (name=<optimized out>) at src/growlight.c:1208
#17 0x000055555555febf in scan_device (name=0x5555555aa270)
    at src/growlight.c:1388
#18 0x00007ffff75defb7 in start_thread (arg=<optimized out>)
    at pthread_create.c:486
#19 0x00007ffff750e49f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
dankamongmen commented 4 years ago

Expanded backtrace with complete libudev trace:

Short read 0/512 from sdk
[Thread 0x7fffc27fc700 (LWP 31428) exited]
[Thread 0x7fffc2ffd700 (LWP 31427) exited]
Assertion 'close_nointr(fd) != -EBADF' failed at src/basic/fd-util.c:71, function safe_close(). Aborting.
[Thread 0x7fffeeffd700 (LWP 31387) exited]
Short read 0/512 from sdj
Short read 0/512 from sdl
[Thread 0x7fffdb7e6700 (LWP 31417) exited]
[Thread 0x7ffff5e5a700 (LWP 31382) exited]
[Thread 0x7fffed7fa700 (LWP 31390) exited]
[Thread 0x7fffe57fa700 (LWP 31397) exited]
[Thread 0x7fffd8fe1700 (LWP 31423) exited]

Thread 14 "growlight-readl" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffe7fff700 (LWP 31392)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff741f535 in __GI_abort () at abort.c:79
#2  0x00007ffff7ab388e in log_assert_failed_realm ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#3  0x00007ffff7aa9d1a in safe_close ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#4  0x00007ffff7aabc6f in closep ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#5  0x00007ffff7aad2a1 in chase_symlinks ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#6  0x00007ffff7a9d445 in device_set_syspath ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#7  0x00007ffff7a9dc0b in sd_device_new_from_syspath ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#8  0x00007ffff7a9df8d in sd_device_new_from_devnum ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#9  0x00007ffff7a8f5e6 in udev_device_new_from_devnum ()
   from /home/dank/src/systemd-243/build/src/udev/libudev.so.1
#10 0x00007ffff7ad40c9 in sk_disk_open ()
   from /usr/lib/x86_64-linux-gnu/libatasmart.so.4
#11 0x0000555555569518 in probe_smart (d=d@entry=0x7fffc40008d0)
    at src/smart.c:29
#12 0x000055555555d5b0 in rescan (d=0x7fffc40008d0, name=0x5555555aa270 "sdb")
    at src/growlight.c:992
#13 rescan (name=<optimized out>, d=<optimized out>) at src/growlight.c:908
#14 0x000055555555e3cc in create_new_device_inner (name=0x5555555aa270 "sdb")
    at src/growlight.c:1144
#15 create_new_device (name=name@entry=0x5555555aa270 "sdb")
    at src/growlight.c:1186
#16 0x000055555555e615 in lookup_device (name=0x5555555aa270 "sdb")
    at src/growlight.c:1256
#17 lookup_device (name=<optimized out>) at src/growlight.c:1208
#18 0x000055555555febf in scan_device (name=0x5555555aa270)
    at src/growlight.c:1388
#19 0x00007ffff75c4fb7 in start_thread (arg=<optimized out>)
    at pthread_create.c:486
#20 0x00007ffff74f449f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) 
dankamongmen commented 4 years ago

strace reveals us getting an EBADF on both an fstat() and some other stuff in threead 32158:

[pid 32168] <... ioctl resumed> , [12000138625024]) = 0
[pid 32158] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 32174] <... access resumed> )      = 0
[pid 32168] openat(AT_FDCWD, "/", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH <unfinished ...>
[pid 32174] gettid( <unfinished ...>
[pid 32168] <... openat resumed> )      = 31
[pid 32174] <... gettid resumed> )      = 32174
[pid 32168] openat(31, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH <unfinished ...>
[pid 32158] write(2, "Error walking sysfs:sdh1 (Bad fi"..., 47 <unfinished ...>
Error walking sysfs:sdh1 (Bad file descriptor)
[pid 32168] <... openat resumed> )      = 37
[pid 32174] openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sda/uevent", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 32158] <... write resumed> )       = 47
[pid 32168] futex(0x7f0f64571570, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32174] <... openat resumed> )      = 38
[pid 32168] <... futex resumed> )       = -1 EAGAIN (Resource temporarily unavailable)
[pid 32174] fstat(38,  <unfinished ...>
[pid 32168] write(2, "CHILD: 37\n", 10CHILD: 37
 <unfinished ...>
[pid 32174] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
[pid 32168] <... write resumed> )       = 10
[pid 32158] futex(0x7f0f64571570, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 32174] fstat(38,  <unfinished ...>
[pid 32168] futex(0x7f0f64571570, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 32174] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=4096, ...}) = 0
[pid 32168] <... futex resumed> )       = 0
[pid 32158] <... futex resumed> )       = 0
[pid 32174] read(38,  <unfinished ...>
[pid 32168] fstat(37,  <unfinished ...>
[pid 32174] <... read resumed> "MAJOR=8\nMINOR=0\nDEVNAME=sda\nDEVT"..., 4096) = 41
[pid 32168] <... fstat resumed> {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
[pid 32174] read(38,  <unfinished ...>
[pid 32168] close(31 <unfinished ...>
[pid 32174] <... read resumed> "", 4096) = 0
[pid 32168] <... close resumed> )       = 0
[pid 32174] close(38 <unfinished ...>
[pid 32168] write(2, "CLOSEP: 0x7f0f3ffe263c -1\n", 26 <unfinished ...>
CLOSEP: 0x7f0f3ffe263c -1
[pid 32174] <... close resumed> )       = 0
[pid 32168] <... write resumed> )       = 26
[pid 32158] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1],  <unfinished ...>
[pid 32168] openat(37, "dev", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH <unfinished ...>
[pid 32158] <... rt_sigprocmask resumed> [], 8) = 0
[pid 32168] <... openat resumed> )      = 31
[pid 32158] getpid( <unfinished ...>
[pid 32168] write(2, "CHILD: 31\n", 10 <unfinished ...>
CHILD: 31
[pid 32174] readlinkat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:01.0/0000:05:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sda/subsystem",  <unfinished ...>
[pid 32168] <... write resumed> )       = 10
[pid 32158] <... getpid resumed> )      = 32129
[pid 32174] <... readlinkat resumed> "../../../../../../../../../../.."..., 4096) = 44
[pid 32168] fstat(31,  <unfinished ...>
[pid 32158] gettid( <unfinished ...>
[pid 32168] <... fstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32174] openat(AT_FDCWD, "/run/udev/data/b8:0", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 32168] close(37 <unfinished ...>
[pid 32158] <... gettid resumed> )      = 32158
[pid 32174] <... openat resumed> )      = 38
[pid 32168] <... close resumed> )       = 0
[pid 32174] fstat(38,  <unfinished ...>
[pid 32168] write(2, "CLOSEP: 0x7f0f3ffe263c -1\n", 26 <unfinished ...>
CLOSEP: 0x7f0f3ffe263c -1
[pid 32174] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=1243, ...}) = 0
[pid 32168] <... write resumed> )       = 26
[pid 32174] fstat(38,  <unfinished ...>
[pid 32168] openat(31, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH <unfinished ...>
[pid 32174] <... fstat resumed> {st_mode=S_IFREG|0644, st_size=1243, ...}) = 0
[pid 32168] <... openat resumed> )      = 37
[pid 32174] read(38,  <unfinished ...>
[pid 32158] tgkill(32129, 32158, SIGWINCH <unfinished ...>
[pid 32174] <... read resumed> "S:disk/by-id/ata-ST12000NM0007-2"..., 4096) = 1243
[pid 32168] write(2, "CHILD: 37\n", 10 <unfinished ...>
CHILD: 37
[pid 32174] read(38,  <unfinished ...>
[pid 32168] <... write resumed> )       = 10
[pid 32174] <... read resumed> "", 4096) = 0
[pid 32168] fstat(37,  <unfinished ...>
[pid 32174] close(38 <unfinished ...>
[pid 32168] <... fstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32174] <... close resumed> )       = 0
[pid 32168] close(31 <unfinished ...>
[pid 32158] <... tgkill resumed> )      = 0
[pid 32168] <... close resumed> )       = 0
[pid 32158] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 32168] write(2, "CLOSEP: 0x7f0f3ffe263c -1\n", 26 <unfinished ...>
CLOSEP: 0x7f0f3ffe263c -1
[pid 32158] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 32168] <... write resumed> )       = 26
[pid 32158] --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_TKILL, si_pid=32129, si_uid=0} ---
[pid 32168] openat(37, "8:32", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH <unfinished ...>
[pid 32158] close(31 <unfinished ...>
[pid 32168] <... openat resumed> )      = 31
[pid 32158] <... close resumed> )       = 0
[pid 32168] write(2, "CHILD: 31\n", 10 <unfinished ...>
CHILD: 31
[pid 32156] openat(34, "partition", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 32168] <... write resumed> )       = 10
[pid 32158] madvise(0x7f0f447f1000, 8368128, MADV_DONTNEED <unfinished ...>
[pid 32156] <... openat resumed> )      = -1 ENOENT (No such file or directory)
[pid 32168] fstat(31,  <unfinished ...>
[pid 32155] <... openat resumed> )      = 32
[pid 32168] <... fstat resumed> 0x7f0f3ffe2780) = -1 EBADF (Bad file descriptor)
[pid 32158] <... madvise resumed> )     = 0
[pid 32156] close(34 <unfinished ...>
[pid 32168] write(2, "CLOSEP: 0x7f0f3ffe263c 31\n", 26 <unfinished ...>
CLOSEP: 0x7f0f3ffe263c 31
[pid 32158] munmap(0x7f0f417eb000, 8392704 <unfinished ...>
[pid 32168] <... write resumed> )       = 26
[pid 32156] <... close resumed> )       = 0
[pid 32168] close(31 <unfinished ...>
[pid 32155] fstat(32,  <unfinished ...>
[pid 32168] <... close resumed> )       = -1 EBADF (Bad file descriptor)
[pid 32158] <... munmap resumed> )      = 0
[pid 32156] openat(30, "queue/physical_block_size", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 32168] writev(2, [{iov_base="Assertion 'close_nointr(fd) != -"..., iov_len=105}, {iov_base="\n", iov_len=1}], 2 <unfinished ...>
Assertion 'close_nointr(fd) != -EBADF' failed at src/basic/fd-util.c:71, function safe_close(). Aborting.[pid 32158] exit(0
 <unfinished ...>
[pid 32155] <... fstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32168] <... writev resumed> )      = 106
[pid 32158] <... exit resumed>)         = ?
[pid 32168] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
[pid 32156] <... openat resumed> )      = 31
[pid 32168] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 32158] +++ exited with 0 +++
[pid 32168] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1],  <unfinished ...>
[pid 32156] read(31,  <unfinished ...>
[pid 32168] <... rt_sigprocmask resumed> [], 8) = 0
[pid 32156] <... read resumed> "512\n", 512) = 4
[pid 32168] getpid( <unfinished ...>
[pid 32156] close(31 <unfinished ...>
[pid 32168] <... getpid resumed> )      = 32129
[pid 32156] <... close resumed> )       = 0
[pid 32168] gettid( <unfinished ...>
[pid 32156] openat(30, "queue/logical_block_size", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 32168] <... gettid resumed> )      = 32168
[pid 32162] mprotect(0x7f0ecc068000, 4096, PROT_READ|PROT_WRITE <unfinished ...>
[pid 32156] <... openat resumed> )      = 31
[pid 32168] tgkill(32129, 32168, SIGABRT <unfinished ...>
[pid 32162] <... mprotect resumed> )    = 0
[pid 32168] <... tgkill resumed> )      = 0
[pid 32156] read(31,  <unfinished ...>
[pid 32168] rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
[pid 32162] mprotect(0x7f0ecc069000, 4096, PROT_READ|PROT_WRITE <unfinished ...>
[pid 32168] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid 32162] <... mprotect resumed> )    = 0
[pid 32168] --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=32129, si_uid=0} ---
[pid 32156] <... read resumed> "512\n", 512) = 4

...in fact, 32158 pretty much had nothing but a bad day. way earlier:

[pid 32158] openat(31, "queue/scheduler", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 32174] openat(33, "sda", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH <unfinished ...>
[pid 32158] <... openat resumed> )      = -1 EBADF (Bad file descriptor)
[pid 32174] <... openat resumed> )      = 31
[pid 32157] <... ioctl resumed> , status=0x2, masked_status=0x1, msg_status=0, sb_len_wr=22, sbp="\x72\x01\x00\x1d\x00\x00\x00\x0e\x09\x0c\x00\x00\x00\x00\x00\x00\x00\x4f\x00\xc2\x40\x50", host_status=0, driver_status=0x8, resid=0, duration=8, info=SG_INFO_CHECK}) = 0
[pid 32174] futex(0x7f0f64571570, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 32158] write(2, "Couldn't determine scheduler for"..., 60 <unfinished ...>
Couldn't determine scheduler for sdh1 (Bad file descriptor)
[pid 32157] ioctl(13, SG_IO, {interface_id='S', dxfer_direction=SG_DXFER_NONE, cmd_len=16, cmdp="\x85\x06\x20\x00\xda\x00\x00\x00\x00\x00\x4f\x00\xc2\x00\xb0\x00", mx_sb_len=32, iovec_count=0, dxfer_len=0, timeout=2000, flags=0 <unfinished ...>
[pid 32158] <... write resumed> )       = 60
[pid 32156] openat(30, "mq", O_RDONLY|O_CLOEXEC|O_DIRECTORY <unfinished ...>
[pid 32158] futex(0x7f0f64571570, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>

and no wonder. thread 32174 appears to have stomped its dirfd earlier (see attached log). 32156 close()s 31 shortly before the first EBADF:

[pid 32158] fcntl(31, F_GETFL <unfinished ...>
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] <... fcntl resumed> )       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4",  <unfinished ...>
[pid 32158] fcntl(31, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] <... fcntl resumed> )       = 0
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2",  <unfinished ...>
[pid 32156] close(31 <unfinished ...>
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] openat(31, "partition", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ..
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2/host2",  <unfinishe
[pid 32158] <... openat resumed> )      = -1 EBADF (Bad file descriptor)

32156 had previously had 31, and closed it already;

[pid 32156] close(31 <unfinished ...>  *********************************
[pid 32160] <... futex resumed> )       = 0
[pid 32158] <... futex resumed> )       = 1
[pid 32156] <... close resumed> )       = 0 ****************************
[pid 32160] getcwd( <unfinished ...>
[pid 32158] openat(5, "../../devices/pci0000:00/0000:00:1f.2/ata8/host8/target8:
[pid 32160] <... getcwd resumed> "/sys/class/block", 4096) = 17
[pid 32158] <... openat resumed> )      = 31
[pid 32160] lstat("/sys/devices",  <unfinished ...>
[pid 32158] fstat(31,  <unfinished ...>
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] <... fstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32160] lstat("/sys/devices/pci0000:00",  <unfinished ...>
[pid 32158] fcntl(31, F_GETFL <unfinished ...>
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] <... fcntl resumed> )       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4",  <unfinished ...>
[pid 32158] fcntl(31, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] <... fcntl resumed> )       = 0
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2",  <unfinished ...>
[pid 32156] close(31 <unfinished ...> *************************
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] openat(31, "partition", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ..
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2/host2",  <unfinishe
[pid 32158] <... openat resumed> )      = -1 EBADF (Bad file descriptor)
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] openat(31, "loop", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ...>
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2/host2/target2:0:0",
[pid 32158] <... openat resumed> )      = -1 EBADF (Bad file descriptor)
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32158] openat(31, "removable", O_RDONLY|O_NONBLOCK|O_CLOEXEC <unfinished ..
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2/host2/target2:0:0/2
[pid 32158] <... openat resumed> )      = -1 EBADF (Bad file descriptor)
[pid 32160] <... lstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32156] <... close resumed> )       = 0 **************************************
[pid 32160] lstat("/sys/devices/pci0000:00/0000:00:11.4/ata2/host2/target2:0:0/2
[pid 32158] write(2, "Couldn't determine removability "..., 63Couldn't determine
 <unfinished ...>

so 32156 gets a happy successful close(), and 32174 eats it. 32174 then finally goes to close its own 31, and blows away that of 32168, which assert()s out. but 32174 was the original victim, and 32156 the offender.

e.gz

dankamongmen commented 4 years ago

thread 32158 there is going through explore_sysfs_node_inner() in our context, as can be determined by the series of openat() parameters.

dankamongmen commented 4 years ago

offender thread 32156 looks like it's been doing some readdir(3)s by the presence of getdents64()...hrmm...

[pid 32156] openat(30, "holders", O_RDONLY|O_CLOEXEC|O_DIRECTORY <unfinished ...>
[pid 32156] <... openat resumed> )      = 31
[pid 32156] fstat(31,  <unfinished ...>
[pid 32156] <... fstat resumed> {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
[pid 32156] fcntl(31, F_GETFL <unfinished ...>
[pid 32156] <... fcntl resumed> )       = 0x18000 (flags O_RDONLY|O_LARGEFILE|O_DIRECTORY)
[pid 32156] fcntl(31, F_SETFD, FD_CLOEXEC <unfinished ...>
[pid 32156] <... fcntl resumed> )       = 0
[pid 32156] getdents64(31 <unfinished ...>
[pid 32156] <... getdents64 resumed> , /* 2 entries */, 32768) = 48
[pid 32156] getdents64(31 <unfinished ...>
[pid 32156] <... getdents64 resumed> , /* 0 entries */, 32768) = 0
[pid 32156] close(31 <unfinished ...>
[pid 32156] <... close resumed> )       = 0
[pid 32156] close(31 <unfinished ...>
[pid 32156] <... close resumed> )       = 0

"holders", huh?

dankamongmen commented 4 years ago

looking like it's probably us:

                                        if((hfd = openat(subfd,"holders",O_RDONLY|O_CLOEXEC|O_DIRECTORY)) >= 0){
                                                // check_slavery() closes hfd on success
                                                if(check_slavery(p,hfd)){
                                                        close(hfd);
                                                        close(subfd);
                                                        return -1;
                                                }
                                        }
dankamongmen commented 4 years ago

Yep. check_slavery() takes ownership of the fd in all cases. We shouldn't have had this close() here, and we needed to continue over on success, lest the close(subfd) finalizer bite us in the ass.

I can no longer reproduce this problem. w00t!