Closed akhilerm closed 5 years ago
Hi @akhilerm
A few questions:
Does the container have root privilege? I assume it does but just doing some sanity checking.
Does it work on SCSI/ATA devices and fail only for NVMe?
Does a compiled binary of openSeaChest work with the -i command line option on /dev/nvme1n1? Trying to figure out if this is a cgo binding issue.
@xahmad
admin@ip-10-1-38-168:~$ sudo ./openSeaChest_NVMe -d /dev/nvme1n1 -i
==========================================================================================
openSeaChest_NVMe - openSeaChest drive utilities - NVMe Enabled
Copyright (c) 2014-2019 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
openSeaChest_NVMe Version: 1.0.0-1_19_23 X86_64
Build Date: Jul 17 2019
Today: Wed Jul 17 15:40:46 2019
==========================================================================================
/dev/nvme0n1 - Amazon Elastic Block Store - vol0502987d85b866574 - NVMe
Also @xahmad , when I built it on my local system running Ubuntu 18.04 LTS, it was working as expected. But using Ubuntu 18 in our CI build did not solve the problem.
Are there any packages that are linked to seachest which are os speciifc. ?
The "/dev/nvme0n1 - Amazon Elastic Block Store - vol0502987d85b866574 - NVMe" is a virtual NVMe device that the EC2 instance is providing you, not an actual physical one, like the one you have in your Intel example.
The fact that when you run openSeaChest from the command line in the EC2 instance, it only show the banner + some initial stuff...means that the utility is actually crashing the same way the library is crashing within the container. You can probably test my theory by running a "strace -f" and the trace will likely show a seg fault.
From the logs the crash happens when we are in scsi_Report_Supported_Operation_Codes function, which is a SCSI operation. In most cases, the get_Device function assumes the device to be SCSI before figuring out if it is actually ATA or NVMe. This part of the code can certainly be improved.
I believe the crash is happening because the logical NVMe device, emulating the NVMe command set, can't seem to handle one of the SCSI op-codes. A physical one (like Intel's NVMe) does.
Let me see if we can find a way to recreate this on our end & debug it a little further. It might be a while before get to this...just setting the expectations here.
@xahmad . I did a make release
on a different machine
akhil@MayaData:~$ uname -a
Linux MayaData 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
akhil@MayaData:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
and executed it on the AWS machine and it is working as expected.
/dev/nvme0n1 - Amazon Elastic Block Store - vol0502987d85b866574 - NVMe
NVMe Controller Information:
Model Number: Amazon Elastic Block Store
Serial Number: vol0502987d85b866574
Firmware Revision: 1.0
IEEE OUI: DC02A0
PCI Vendor ID: 1D0F
PCI Subsystem Vendor ID: 1D0F
Controller ID: Not Supported
NVMe Version: Not reported (NVMe 1.1 or older)
FGUID: Not Supported
Write Cache: Disabled
Maximum Number Of Namespaces: 1
Read-Only Medium: False
SMART Status: Good
Composite Temperature (K): 273
Percent Used (%): 0
Available Spare (%): 100
Power On Time:
Power On Hours (hours): 0
Last DST information:
Not supported
Long Drive Self Test Time: Not Supported
Annualized Workload Rate (TB/yr): inf
Total Bytes Read (MB): 250.88
Total Bytes Written (GB): 5.28
Encryption Support: Not Supported
Number of Firmware Slots: 1
Controller Features:
NVMe Namespace Information:
Namespace Size (GB/GiB): 137.44/128.00
Namespace Size (LBAs): 268435455
Namespace Capacity (GB/GiB): 137.44/128.00
Namespace Capacity (LBAs): 268435456
Namespace Utilization (B/B): 0.00/0.00
Namespace Utilization (LBAs): 0
Logical Block Size (B): 512
Logical Block Size Relative Performance: Best Performance
NGUID: Not Supported
EUI64: Not Supported
Namespace Features:
I was able to get all the details, which means the possibility of missing op-code is less.
I tried again building on the first(our CI setup, a travis VM) machine, the binary is crashing. Are there any os specific library or header files that seachest uses while compilation?
The issue happened in device with write cache support. Have raised a PR in opensea-transport
Hi @akhilerm,
Thanks for the PR! I have merged it into opensea-transport master and develop branches.
I also updated openSeaChest and added a new tag called Release-19.06.02
that has the fix in place. I reviewed other sections of the SNTL code and made one other change to help ensure there aren't other crashes in the report supported operation codes translation going forward either.
We have a machine with NVMe device attached in the following configuration
We are using openseachest library to get disk information in openebs/node-disk-manager. The code is written in golang and
cgo
bindings are used to interact with seachest library.make release
is used to generate the static libs of openseachest. The binary is being run inside a docker container on the node and this container is a part of a larger kubernetes cluster. We are hitting a crash with the below logs.The complete logs are available here
The following is the backtrace generated when we evalauted the core dump using gdb
Additional Info: