akhilerm commented 5 years ago

We have a machine with NVMe device attached in the following configuration

admin@ip-10-1-38-168:~$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
nvme0n1     259:1    0  128G  0 disk 
└─nvme0n1p1 259:2    0    8G  0 part /
nvme1n1     259:0    0 69.9G  0 disk

We are using openseachest library to get disk information in openebs/node-disk-manager. The code is written in golang and cgo bindings are used to interact with seachest library. make release is used to generate the static libs of openseachest. The binary is being run inside a docker container on the node and this container is a part of a larger kubernetes cluster. We are hitting a crash with the below logs.

I0716 10:22:44.142004       7 controller.go:217] Disk CRD is available
I0716 10:22:44.144204       7 controller.go:233] BlockDevice CRD is available
I0716 10:22:44.144226       7 filter.go:58] registering filters
I0716 10:22:44.144248       7 filter.go:66] configured os disk exclude filter : state enable
I0716 10:22:44.144420       7 filter.go:66] configured vendor filter : state enable
I0716 10:22:44.144430       7 filter.go:66] configured path filter : state enable
I0716 10:22:44.144436       7 probe.go:62] registering probes
I0716 10:22:44.144445       7 probe.go:83] configured seachest probe : state enable
I0716 10:22:44.144462       7 probe.go:83] configured smart probe : state enable
I0716 10:22:44.144470       7 probe.go:83] configured mount probe : state enable
I0716 10:22:44.144508       7 probe.go:83] configured udev probe : state enable
I0716 10:22:44.144587       7 udevprobe.go:210] starting udev probe listener
I0716 10:22:44.146152       7 probe.go:83] configured capacity probe : state enable
I0716 10:22:44.146176       7 sparsefilegenerator.go:171] Sparse file already exists: 0-ndm-sparse.img
I0716 10:22:44.146209       7 sparsefilegenerator.go:223] Updating the BlockDevice CR for Sparse file: sparse-a235dcc97377568f64273332683c8cdf
I0716 10:22:44.148759       7 eventhandler.go:49] Processing details for disk-b33d9997949d356995b047615e5558b9
I0716 10:22:44.148903       7 probe.go:107] details filled by udev probe
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x119f07d]

runtime stack:
runtime.throw(0x144ff90, 0x2a)
    /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/panic.go:617 +0x72 fp=0x7ffc2d7a6c20 sp=0x7ffc2d7a6bf0 pc=0x43ddd2
runtime.sigpanic()
    /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:374 +0x4a9 fp=0x7ffc2d7a6c50 sp=0x7ffc2d7a6c20 pc=0x4528c9

goroutine 157 [syscall]:
runtime.cgocall(0x1115b20, 0xc000519008, 0x18)
    /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/cgocall.go:128 +0x5b fp=0xc000518fd8 sp=0xc000518fa0 pc=0x41461b
github.com/openebs/node-disk-manager/pkg/seachest._Cfunc_get_SCSI_Drive_Information(0xc00028a000, 0xc00035c000, 0xc000000000)
    _cgo_gotypes.go:986 +0x4d fp=0xc000519008 sp=0xc000518fd8 pc=0x11049ad
github.com/openebs/node-disk-manager/pkg/seachest.(*Identifier).SeachestBasicDiskInfo(0xc0005193f0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/pkg/seachest/seachest.go:89 +0x292 fp=0xc000519378 sp=0xc000519008 pc=0x1104d12
github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe.(*seachestProbe).FillDiskDetails(0xc0003c5310, 0xc0000b7ba0)
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe/seachestprobe.go:98 +0x82 fp=0xc0005196d0 sp=0xc000519378 pc=0x110fec2
github.com/openebs/node-disk-manager/cmd/ndm_daemonset/controller.(*Probe).FillDiskDetails(...)
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/cmd/ndm_daemonset/controller/probe.go:47
github.com/openebs/node-disk-manager/cmd/ndm_daemonset/controller.(*Controller).FillDiskDetails(0xc0003162d0, 0xc0000b7ba0)
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/cmd/ndm_daemonset/controller/probe.go:106 +0xe1 fp=0xc000519750 sp=0xc0005196d0 pc=0x10fef11
github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe.(*ProbeEvent).addDiskEvent(0xc0000b2000, 0x1426869, 0x3, 0xc00048ffa0, 0x3, 0x4)
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe/eventhandler.go:50 +0x225 fp=0xc000519f20 sp=0xc000519750 pc=0x110eb05
github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe.(*udevProbe).listen(0xc00048fec0)
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe/udevprobe.go:215 +0x1d4 fp=0xc000519fd8 sp=0xc000519f20 pc=0x1112bd4
runtime.goexit()
    /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000519fe0 sp=0xc000519fd8 pc=0x46a6b1
created by github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe.(*udevProbe).Start
    /home/travis/gopath/src/github.com/openebs/node-disk-manager/cmd/ndm_daemonset/probe/udevprobe.go:115 +0x3f

The complete logs are available here

The following is the backtrace generated when we evalauted the core dump using gdb

#0  runtime.raise () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/sys_linux_amd64.s:150
#1  0x000000000045291b in runtime.dieFromSignal (sig=6) at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:424
#2  0x0000000000452e9d in runtime.sigfwdgo (sig=6, info=0xc0003174b0, ctx=0xc000317380, ~r3=<optimized out>)
    at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:629
#3  0x0000000000451fc0 in runtime.sigtrampgo (sig=<optimized out>, info=0xc0003174b0, ctx=0xc000317380)
    at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:289
#4  0x000000000046c2a3 in runtime.sigtramp () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/sys_linux_amd64.s:357
#5  <signal handler called>
#6  runtime.raise () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/sys_linux_amd64.s:150
#7  0x000000000045291b in runtime.dieFromSignal (sig=6) at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:424
#8  0x0000000000452aba in runtime.crash () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:518
#9  0x0000000000466f21 in runtime.fatalthrow.func1 () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/panic.go:676
#10 0x000000000043dfa7 in runtime.fatalthrow () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/panic.go:669
#11 0x000000000043ddd2 in runtime.throw (s=...) at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/panic.go:617
#12 0x00000000004528c9 in runtime.sigpanic () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/signal_unix.go:374
#13 0x000000000119f07d in sntl_Check_Operation_Code_and_Service_Action ()
#14 0x00000000011a0bf2 in sntl_Translate_SCSI_Report_Supported_Operation_Codes_Command ()
#15 0x00000000011a10df in sntl_Translate_SCSI_Command ()
#16 0x000000000116fce8 in scsi_Send_Cdb ()
#17 0x00000000011702d8 in scsi_Report_Supported_Operation_Codes ()
#18 0x000000000112854f in get_Supported_FWDL_Modes ()
#19 0x000000000111ccea in get_SCSI_Drive_Information ()
#20 0x0000000001115b3b in _cgo_8a300aa9fb36_Cfunc_get_SCSI_Drive_Information (v=0xc0004a3008) at cgo-gcc-prolog:104
#21 0x0000000000469e30 in runtime.asmcgocall () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/asm_amd64.s:635
#22 0x0000000002403d80 in runtime.timers ()
#23 0x0000000000000003 in ?? ()
#24 0x00000000015c0059 in func.* ()
#25 0x00007ff0cd4e7f90 in ?? ()
#26 0x000000c0004a2df8 in ?? ()
#27 0x0000000000001068 in ?? ()
#28 0x000000c00031d380 in ?? ()
#29 0x0000000000442380 in ?? () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/proc.go:1082
---Type <return> to continue, or q <return> to quit---
#30 0x00007ff0ca675ae0 in ?? ()
#31 0x0000000000800000 in internal/x/net/http/httpguts.init.ializers ()
    at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/internal/x/net/http/httpguts/guts.go:31
#32 0x000000c0000b5080 in ?? ()
#33 0x0000000000442380 in ?? () at /home/travis/.gimme/versions/go1.12.7.linux.amd64/src/runtime/proc.go:1082
#34 0x000000000111602b in threadentry (v=<optimized out>) at gcc_linux_amd64.c:102
#35 0x00007ff0cce616ba in start_thread (arg=0x7ff0c248a700) at pthread_create.c:333
#36 0x00007ff0ccb9741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Additional Info:

openSeachest : Tag `Hotfix-Makefile_fix`

Machine in which openseachest and binary was compiled
============================================
OS : Ubuntu 16.04.6 LTS
Kernel Version : 4.15.0-1028-gcp, Systemd 229

Machine which reported the crash
=========================
AWS EC2 instance. (m5d.large)
admin@ip-10-1-38-168:~$ uname -a
Linux ip-10-1-38-168 4.4.148-k8s #1 SMP Thu Aug 16 19:29:46 UTC 2018 x86_64 GNU/Linux

admin@ip-10-1-38-168:~$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian

admin@ip-10-1-38-168:~$ docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:09:56 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.03.2-ce
 API version:  1.27 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:09:56 2017
 OS/Arch:      linux/amd64
 Experimental: false

admin@ip-10-1-38-168:~$ sudo udevadm info /dev/nvme0n1
P: /devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1
N: nvme0n1
E: DEVNAME=/dev/nvme0n1
E: DEVPATH=/devices/pci0000:00/0000:00:04.0/nvme/nvme0/nvme0n1
E: DEVTYPE=disk
E: ID_PART_TABLE_TYPE=dos
E: ID_PART_TABLE_UUID=66d7a366
E: MAJOR=259
E: MINOR=1
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=4147

admin@ip-10-1-38-168:~$ sudo udevadm info /dev/nvme1n1
P: /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
N: nvme1n1
E: DEVNAME=/dev/nvme1n1
E: DEVPATH=/devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1
E: DEVTYPE=disk
E: MAJOR=259
E: MINOR=0
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=54026

xahmad commented 5 years ago

Hi @akhilerm

A few questions:

Does the container have root privilege? I assume it does but just doing some sanity checking.
Does it work on SCSI/ATA devices and fail only for NVMe?
Does a compiled binary of openSeaChest work with the -i command line option on /dev/nvme1n1? Trying to figure out if this is a cgo binding issue.

akhilerm commented 5 years ago

@xahmad

Yes the container is running in privileged mode
Yes, it works for SCSI/ATA devices and fail only for NVMe. That too for NVMe on AWS. On another machine with NVMe device from intel it is working as expected

It doesn't crash, but I think the info given is only partial


admin@ip-10-1-38-168:~$ sudo ./openSeaChest_NVMe -d /dev/nvme1n1 -i
==========================================================================================
openSeaChest_NVMe - openSeaChest drive utilities - NVMe Enabled
Copyright (c) 2014-2019 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
openSeaChest_NVMe Version: 1.0.0-1_19_23 X86_64
Build Date: Jul 17 2019
Today: Wed Jul 17 15:40:46 2019
==========================================================================================

/dev/nvme1n1 - Amazon EC2 NVMe Instance Storage - AWS206F4D15810744582 - NVMe admin@ip-10-1-38-168:~$ sudo ./openSeaChest_NVMe -d /dev/nvme0n1 -i

openSeaChest_NVMe - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2019 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_NVMe Version: 1.0.0-1_19_23 X86_64 Build Date: Jul 17 2019 Today: Wed Jul 17 15:41:00 2019

/dev/nvme0n1 - Amazon Elastic Block Store - vol0502987d85b866574 - NVMe

akhilerm commented 5 years ago

Also @xahmad , when I built it on my local system running Ubuntu 18.04 LTS, it was working as expected. But using Ubuntu 18 in our CI build did not solve the problem.

Are there any packages that are linked to seachest which are os speciifc. ?

xahmad commented 5 years ago

The "/dev/nvme0n1 - Amazon Elastic Block Store - vol0502987d85b866574 - NVMe" is a virtual NVMe device that the EC2 instance is providing you, not an actual physical one, like the one you have in your Intel example.

The fact that when you run openSeaChest from the command line in the EC2 instance, it only show the banner + some initial stuff...means that the utility is actually crashing the same way the library is crashing within the container. You can probably test my theory by running a "strace -f" and the trace will likely show a seg fault.

From the logs the crash happens when we are in scsi_Report_Supported_Operation_Codes function, which is a SCSI operation. In most cases, the get_Device function assumes the device to be SCSI before figuring out if it is actually ATA or NVMe. This part of the code can certainly be improved.

I believe the crash is happening because the logical NVMe device, emulating the NVMe command set, can't seem to handle one of the SCSI op-codes. A physical one (like Intel's NVMe) does.

Let me see if we can find a way to recreate this on our end & debug it a little further. It might be a while before get to this...just setting the expectations here.

akhilerm commented 5 years ago

@xahmad . I did a make release on a different machine

akhil@MayaData:~$ uname -a
Linux MayaData 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
akhil@MayaData:~$ cat /etc/os-release 
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"

and executed it on the AWS machine and it is working as expected.

/dev/nvme0n1 - Amazon Elastic Block Store - vol0502987d85b866574 - NVMe
NVMe Controller Information:
    Model Number: Amazon Elastic Block Store
    Serial Number: vol0502987d85b866574
    Firmware Revision: 1.0
    IEEE OUI: DC02A0
    PCI Vendor ID: 1D0F
    PCI Subsystem Vendor ID: 1D0F
    Controller ID: Not Supported
    NVMe Version: Not reported (NVMe 1.1 or older)
    FGUID: Not Supported
    Write Cache: Disabled
    Maximum Number Of Namespaces: 1
    Read-Only Medium: False
    SMART Status: Good
    Composite Temperature (K): 273
    Percent Used (%): 0
    Available Spare (%): 100
    Power On Time:  
    Power On Hours (hours): 0
    Last DST information:
        Not supported
    Long Drive Self Test Time: Not Supported
    Annualized Workload Rate (TB/yr): inf
    Total Bytes Read (MB): 250.88
    Total Bytes Written (GB): 5.28
    Encryption Support: Not Supported
    Number of Firmware Slots: 1
    Controller Features:

NVMe Namespace Information:
    Namespace Size (GB/GiB): 137.44/128.00
    Namespace Size (LBAs): 268435455
    Namespace Capacity (GB/GiB): 137.44/128.00
    Namespace Capacity (LBAs): 268435456
    Namespace Utilization (B/B): 0.00/0.00
    Namespace Utilization (LBAs): 0
    Logical Block Size (B): 512
    Logical Block Size Relative Performance: Best Performance
    NGUID: Not Supported
    EUI64: Not Supported
    Namespace Features:

I was able to get all the details, which means the possibility of missing op-code is less.

I tried again building on the first(our CI setup, a travis VM) machine, the binary is crashing. Are there any os specific library or header files that seachest uses while compilation?

akhilerm commented 5 years ago

The issue happened in device with write cache support. Have raised a PR in opensea-transport

vonericsen commented 5 years ago

Hi @akhilerm, Thanks for the PR! I have merged it into opensea-transport master and develop branches. I also updated openSeaChest and added a new tag called Release-19.06.02 that has the fix in place. I reviewed other sections of the SNTL code and made one other change to help ensure there aren't other crashes in the report supported operation codes translation going forward either.

Seagate / openSeaChest

openSeachest is crashing while trying to fetch details of NVMe device #21

/dev/nvme1n1 - Amazon EC2 NVMe Instance Storage - AWS206F4D15810744582 - NVMe admin@ip-10-1-38-168:~$ sudo ./openSeaChest_NVMe -d /dev/nvme0n1 -i

openSeaChest_NVMe - openSeaChest drive utilities - NVMe Enabled Copyright (c) 2014-2019 Seagate Technology LLC and/or its Affiliates, All Rights Reserved openSeaChest_NVMe Version: 1.0.0-1_19_23 X86_64 Build Date: Jul 17 2019 Today: Wed Jul 17 15:41:00 2019