Open-CAS / open-cas-linux

Open CAS Linux
https://open-cas.com
BSD 3-Clause "New" or "Revised" License
224 stars 82 forks source link

CAS load failed after machine hard reset(12 tests, one occurrence) #469

Closed yinlei-github closed 4 years ago

yinlei-github commented 4 years ago

CAS load failed after machine hard reset. 12 tests, one occurrence.

load command: [root@local ~]# casadm -S -i 1 -d /dev/nvme0n1p1 --load Error inserting cache 1 Failed to insert cache [root@local ~]#

system version: Centos8.0 4.18.0-80.11.2.el8_0.x86_64

CAS version: +=========================+===================+ | Name | Version | +=========================+===================+ | CAS Cache Kernel Module | 20.06.00.00000709 | | CAS Disk Kernel Module | 20.06.00.00000709 | | CAS CLI Utility | 20.06.00.00000709 |

config before reset: type id disk status write policy device cache 1 /dev/nvme0n1p1 Running wo - ├core 1 /dev/sdb Active - /dev/cas1-1 ├core 2 /dev/sdc Active - /dev/cas1-2 ├core 3 /dev/sdd Active - /dev/cas1-3 ├core 4 /dev/sde Active - /dev/cas1-4 ├core 5 /dev/sdf Active - /dev/cas1-5 ├core 6 /dev/sdg Active - /dev/cas1-6 ├core 7 /dev/sdh Active - /dev/cas1-7 ├core 8 /dev/sdi Active - /dev/cas1-8 ├core 9 /dev/sdj Active - /dev/cas1-9 ├core 10 /dev/sdk Active - /dev/cas1-10 ├core 11 /dev/sdl Active - /dev/cas1-11 └core 12 /dev/sdm Active - /dev/cas1-12

dmesg info: [Tue Jul 21 11:27:02 2020] Inserting cache cache1 [Tue Jul 21 11:27:02 2020] Thread cas_mu_cache1 started [Tue Jul 21 11:27:02 2020] cache1: Metadata initialized [Tue Jul 21 11:27:02 2020] cache1: Successfully added [Tue Jul 21 11:27:02 2020] cache1: Cache mode : wt [Tue Jul 21 11:27:02 2020] Thread cas_io_1_0 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_1 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_2 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_3 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_4 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_5 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_6 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_7 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_8 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_9 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_10 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_11 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_12 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_13 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_14 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_15 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_16 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_17 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_18 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_19 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_20 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_21 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_22 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_23 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_24 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_25 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_26 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_27 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_28 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_29 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_30 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_31 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_32 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_33 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_34 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_35 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_36 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_37 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_38 started [Tue Jul 21 11:27:02 2020] Thread cas_io_1_39 started [Tue Jul 21 11:27:02 2020] Thread cas_mngt_1 started [Tue Jul 21 11:27:06 2020] cache1: Super block config offset : 0 kiB [Tue Jul 21 11:27:06 2020] cache1: Super block config size : 2196 B [Tue Jul 21 11:27:06 2020] cache1: Super block runtime offset : 128 kiB [Tue Jul 21 11:27:06 2020] cache1: Super block runtime size : 4 B [Tue Jul 21 11:27:06 2020] cache1: Reserved offset : 256 kiB [Tue Jul 21 11:27:06 2020] cache1: Reserved size : 128 kiB [Tue Jul 21 11:27:06 2020] cache1: Part config offset : 384 kiB [Tue Jul 21 11:27:06 2020] cache1: Part config size : 48 kiB [Tue Jul 21 11:27:06 2020] cache1: Part runtime offset : 512 kiB [Tue Jul 21 11:27:06 2020] cache1: Part runtime size : 4 kiB [Tue Jul 21 11:27:06 2020] cache1: Core config offset : 640 kiB [Tue Jul 21 11:27:06 2020] cache1: Core config size : 484 kiB [Tue Jul 21 11:27:06 2020] cache1: Core runtime offset : 1152 kiB [Tue Jul 21 11:27:06 2020] cache1: Core runtime size : 1172 kiB [Tue Jul 21 11:27:06 2020] cache1: Core UUID offset : 2432 kiB [Tue Jul 21 11:27:06 2020] cache1: Core UUID size : 16384 kiB [Tue Jul 21 11:27:06 2020] cache1: Cleaning offset : 18816 kiB [Tue Jul 21 11:27:06 2020] cache1: Cleaning size : 5659512 kiB [Tue Jul 21 11:27:06 2020] cache1: Eviction offset : 5678336 kiB [Tue Jul 21 11:27:06 2020] cache1: Eviction size : 3769324 kiB [Tue Jul 21 11:27:06 2020] cache1: Collision offset : 9447680 kiB [Tue Jul 21 11:27:06 2020] cache1: Collision size : 5659512 kiB [Tue Jul 21 11:27:06 2020] cache1: List info offset : 15107200 kiB [Tue Jul 21 11:27:06 2020] cache1: List info size : 8041220 kiB [Tue Jul 21 11:27:06 2020] cache1: Hash offset : 23148544 kiB [Tue Jul 21 11:27:06 2020] cache1: Hash size : 471168 kiB [Tue Jul 21 11:27:06 2020] cache1: Cache line size: 4 kiB [Tue Jul 21 11:27:06 2020] cache1: Metadata capacity: 24905 MiB [Tue Jul 21 11:27:19 2020] cache1: OCF metadata self-test PASSED [Tue Jul 21 11:27:22 2020] cache1: Loading cache state... [Tue Jul 21 11:27:22 2020] cache1: Loading Part runtime ERROR, invalid checksum [Tue Jul 21 11:27:22 2020] cache1: Metadata read FAILURE [Tue Jul 21 11:27:22 2020] cache1: Metadata Error [Tue Jul 21 11:27:22 2020] cache1: ERROR: Cannot load cache state [Tue Jul 21 11:27:23 2020] Thread cas_io_1_39 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_38 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_37 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_36 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_35 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_34 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_33 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_32 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_31 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_30 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_29 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_28 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_27 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_26 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_25 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_24 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_23 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_22 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_21 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_20 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_19 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_18 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_17 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_16 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_15 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_14 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_13 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_12 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_11 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_10 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_9 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_8 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_7 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_6 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_5 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_4 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_3 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_2 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_1 stopped [Tue Jul 21 11:27:23 2020] Thread cas_io_1_0 stopped [Tue Jul 21 11:27:23 2020] cache1: Cache cache1 successfully stopped [Tue Jul 21 11:27:23 2020] Thread cas_mngt_1 stopped [Tue Jul 21 11:27:23 2020] Thread cas_mu_cache1 stopped

gwnet commented 4 years ago

@yinlei-jinan buddy, can you run the nvme-cli tool to show us the nvme model you are using? install nvme-cli, then run sudo nvme list

gwnet commented 4 years ago

@yinlei-jinan buddy, I am working with our architect for this issue. we take it very seriously. can you collect this information to us? for intel ssd, share the all the VWC capability for all your drive, that is volatile write cache that in NVMe spec. you can use the isdct tool to dump it out for any intel ssd. example command is

sudo isdct show -identify -nvmecontroller -intelssd 1

-intelssd, need give the intel ssd number from 0-n

for non-intel ssd, please consult the vendor how to dump VWC capability for their drive.

yinlei-github commented 4 years ago

[root@root ~]# nvme list Node SN Model Namespace Usage Format FW Rev


/dev/nvme0n1 BTLJ004307Z32P0BGN INTEL SSDPE2KX020T8 1 2.00 TB / 2.00 TB 512 B + 0 B VDV10131 [root@root ~]#

yinlei-github commented 4 years ago

[root@root ~]# isdct show -identify -nvmecontroller -intelssd 0

PCI Vendor ID (VID) : 8086

PCI Subsystem Vendor ID (SSVID) : 8086

Serial Number : BTLJ004307Z32P0BGN

Model Number : INTEL SSDPE2KX020T8

Firmware Revision : VDV10131

Recommended Arbitration Burst (RAB) : 00

IEEE Organization Unique Identifier : E4D25C000500

Multi-Interface Capabilities : 0

Maximum Data Transfer Size : 5

Controller ID : 0

Version : 10200 Bit 31:24 - Tertiary Version : 0 Bit 23:16 - Minor Version : 2 Bit 15:0 - Major Version : 1

RTD3 Resume Latency : 2000000

RTD3 Entry Latency : 3000000

Optional Asynchronous Events Supported : 0 Bit 8 - Namespace Attribute Changed Support : 0

Reserved :

Optional Admin Command Support : E Bit 3 - Support Namespace Management and Attachment : 1 Bit 2 - Support Firmware Activate Download : 1 Bit 1 - Support Format NVM : 1 Bit 0 - Support Security Send Receive : 0

Abort Command Limit : 3

Asynchronous Event Request Limit : 3

Firmware Updates : 2 Bit 4 - Firmware Activation with No Reset : 0 Bit 3:1 - Number FW Slots Support : 1 Bit 0 - Is FW Slot Read Only : 0

Log Page Attributes : 2 Bit 0 - Support SMART Per Namespace : 0

Error Log Page Entries : 3F

Number of Power States Support : 0

Admin Vendor Specific Cmd Configuration : 0 Bit 0 - Admin Vendor Specific CMD Format : 0

Autonomous Power State Transition Attributes : 0 Bit 0 - Autonomous Power State Transitions Support : 0

Warning Composite Temperature Threshold : 343

Critical Composite Temperature Threshold : 353

Maximum Time Firmware Activation : 0

Host Memory Buffer Preferred Size : 0

Host Memory Buffer Minimum Size : 0

Total NVM Capacity : 0000000000000000000001D1C1116000

Unallocated NVM Capacity : 00000000000000000000000000000000

Replay Protected Memory Block Support : 0 Bit 31:24 - Access Size : 0 Bit 23:16 - Total Size : 0 Bit 5:3 - Authentication Method : 0 Bit 2:0 - Number of RPMB Units : 0

Reserved :

Submission Queue Entry Size : 66 Bit 7:4 - Maximum Entry Size : 6 Bit 3:0 - Minimum Entry Size : 6

Completion Queue Entry Size : 44 Bit 7:4 - Maximum Entry Size : 4 Bit 3:0 - Minimum Entry Size : 4

Reserved :

Number of Namespaces : 1

Optional NVM Command Support : 6 Bit 5 - Reservations Support : 0 Bit 4 - Save Field Support : 0 Bit 3 - Write Zeroes Support : 0 Bit 2 - Dataset Management Command Support : 1 Bit 1 - Write Uncorrectable Command Support : 1 Bit 0 - Compare Command Support : 0

Fused Operation Support : 0 Bit 0 - Compare and Write Fused Operation Support : 0

Format NVM Attributes : 4 Bit 2 - cryptographic Erase Support : 1 Bit 1 - secure Erase All Namespaces Support : 0 Bit 0 - format All Namespaces Support : 0

Volatile Write Cache : 0 Bit 0 - NVM Vendor Specific CMD Format : 0

Atomic Write Unit Normal : 0

Atomic Write Unit Power Fail : 0

NVM Vendor Specific Command Configuration : 0 Bit 0 - NVM Vendor Specific CMD Format : 0

Reserved :

Atomic Compare and Write Unit : 0

Reserved :

SGL Support : 0 Bit 18 - SGL Data Extended Length Support : 0 Bit 17 - Byte Aligned Contiguous Buffer Support : 0 Bit 16 - SGL Bit Bucket Descriptor Support : 0 Bit 0:1 - SGL Support for NVM Command Set : 0

Reserved :

Power State 0 Descriptor : Bit 183:182 - Active Power Scale : 0 Bit 178:176 - Active Power Workload : 0 Bit 175:160 - Active Power : 0 Bit 151:150 - Idle Power Scale : 0 Bit 143:128 - Idle Power : 0 Bit 124:120 - Relative Write Latency : 0 Bit 116:112 - Relative Write Throughput : 0 Bit 108:104 - Relative Read Latency : 0 Bit 100:96 - Relative Read Throughput : 0 Bit 95:64 - Exit Latency : 0 Bit 63:32 - Entry Latency : 0 Bit 25 - Non-Operational State : 0 Bit 24 - Maximum Power Scale : 0 Bit 15:0 - Maximum Power : 9C4

Stripe Size : 5 (128 KiB of user data)

Current PCIe Link Speed : 3

Negotiated PCIe Link Width : 4

Security Capabilities : 0

Security Status : 0

Bootloader Revision : 0203

[root@root ~]#

yinlei-github commented 4 years ago

There are the same problems in release v20.3(installed by open-cas-linux-v20.03.0.0286.tar.gz)

[root@root ~]# casadm -V ╔═════════════════════════╤═══════════════════╗ ║ Name │ Version ║ ╠═════════════════════════╪═══════════════════╣ ║ CAS Cache Kernel Module │ 20.03.00.00000721 ║ ║ CAS Disk Kernel Module │ 20.03.00.00000721 ║ ║ CAS CLI Utility │ 20.03.00.00000721 ║ ╚═════════════════════════╧═══════════════════╝ [root@root ~]#

gwnet commented 4 years ago

@yinlei-jinan thank you so much. our architect find the highly possible root cause, he is actively fix the issue. I will discuss with team to release one patch for you ASAP. I suggest that you can test other senario next week. when our patch is released, then we verify it again. sounds good?

gwnet commented 4 years ago

@yinlei-jinan we have released v20.3.1 https://github.com/Open-CAS/open-cas-linux/releases/tag/v20.3.1 could you please try this one? BTW, could you please let us know when you reproduce the issue, how and when you do the hard reset from your environment? can you be as much specific as you can? if you use any scripts to do that, can you share us?

yinlei-github commented 4 years ago

@gwnet we reset environment through IPMI hard reset API on web GUI.

gwnet commented 4 years ago

@yinlei-jinan hello buddy, do you have a chance to verify our 20.3.1 released software? if not, can we put this as low priority or close for now. if you find issue or need our further help, you can reopen then we can continue follow up. is this OK?

gwnet commented 4 years ago

yinlei-jinan currently do not have time to very this from his side.

gwnet commented 4 years ago

currently cannot reproduce this from internal lab. the issue should be fixed based on code review. go head close it for now. if anyone found similar issue with 20.3.1, we will check again.

gwnet commented 4 years ago

@yinlei-jinan can you close it now?