NVMe drives started to disconnect randomly

marboroman commented 3 months ago

Install ENV:

CPU: i3-9100T
NIC: (pid & vid) 8125r

RR version:

RR: 24.6.0
modules:
misc: "" acpid: "" reboottoloader: "" hdddb: "" storagepanel: TOWER_5_Bay 1X2
lkms:
prod

DSM:

model: SA6400
version: DSM 7.2.1-69057 Update 5

Issue:

Everything was fine for a week: stable operation without reboots. The baremetal installation has a storage pool consisting of 2 nvme disks. Suddenly tonight one nvme fell off. The pool remained running on the second disk in degraded status. I rebooted and the first disk appeared again but the pool needed recovery. After restoration, everything worked fine for 3 hours and suddenly both nvme disks fell off at once. After the reboot, one of them appeared and the pool is still operating in a degraded status. This is some kind of disaster. Everything worked fine for a week!

logs: please let me know wich log files needed to investigate...

wjz304 commented 3 months ago

/var/log/messages /var/log/kern.log /var/log/disk.log /var/log/synobootup.log

marboroman commented 3 months ago

logs.zip

please note: I used nvmesystem plugin for this install. Maybe that plugin causes such problems.

marboroman commented 2 months ago

wjz304, did you check my log files? Please help to find out what happened.

wjz304 commented 2 months ago

Did you execute some other scripts? (For example, https://github.com/007revad/Synology_HDD_db)

wjz304 commented 2 months ago

Is there an exact time when the disk was lost? There are a lot of logs, and I can't be sure which ones are the key information.

marboroman commented 2 months ago

Is there an exact time when the disk was lost? There are a lot of logs, and I can't be sure which ones are the key information.

13.06.2024 13:12 GMT+3

other scripts

misc: "" acpid: "" reboottoloader: "" hdddb: "" storagepanel: TOWER_5_Bay 1X2

hdddb was always installed but i didn't run it manuallly.

wjz304 commented 2 months ago

# messages
2024-06-13T12:40:59+03:00 ASUNAS synostgd-disk[11390]: disk_monitor.c:289 The temperature[51] of /dev/nvme0n1 >= T2. (T1: -273, T2: -273)
2024-06-13T12:40:59+03:00 ASUNAS synostgd-disk[11392]: disk_monitor.c:289 The temperature[51] of /dev/nvme1n1 >= T2. (T1: -273, T2: -273)
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.529751] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.541278] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.575029] nvme nvme0: Removing after probe failure status: -19
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.591977] nvme nvme1: Removing after probe failure status: -19

The preset values of the hard disk temperature are T1: -273, T2: -273, which causes the temperature 51 to be considered to exceed the upper limit.

marboroman commented 2 months ago

The preset values of the hard disk temperature are T1: -273, T2: -273, which causes the temperature 51 to be considered to exceed the upper limit.

is set by hdddb? how to fix it?

wjz304 commented 2 months ago

nvme get-feature /dev/nvme0 See what information is returned?

marboroman commented 2 months ago

and I checked the temperatures for those nvmes earlier. They were about 51C for the whole week. And it was stable

wjz304 commented 2 months ago

I don't know, I haven't encountered this situation, in theory these values are related to the disk firmware

# get
nvme get-feature /dev/nvme0

# set
nvme set-feature /dev/nvme0 -f 0x04 -v 343 -s 353

# temp
nvme smart-log /dev/nvme0 | grep temperature

Let's try it first

marboroman commented 2 months ago

nvme get-feature /dev/nvme0

feature-id required param

marboroman commented 2 months ago

nvme set-feature /dev/nvme0 -f 0x04 -v 343 -s 353

NVMe Status:FEATURE_NOT_SAVEABLE(410d)

marboroman commented 2 months ago

nvme smart-log /dev/nvme0 | grep temperature temperature : 35 C

marboroman commented 2 months ago

just turned it on. that's why it's 35 for a while.

wjz304 commented 2 months ago

There is another problem, macs is empty, Did you delete it from the log, or is it empty by default?

marboroman commented 2 months ago

i deleted sn and macs

wjz304 commented 2 months ago

i deleted sn and macs

OK, nothing, just some "Failed to connect server" errors.

wjz304 commented 2 months ago

2024-06-06T01:55:22+03:00 ASUNAS synostgdisk[12194]: space_pool_disk_compat.c:176 Failed to copy '/var/lib/space/pool_compatibility' to '/run/space/pool_compatibility'
2024-06-06T01:55:22+03:00 ASUNAS synostgdisk[12194]: space_pool_disk_compat.c:179 Failed to chmod : [/run/space/pool_compatibility] , errno=No such file or directory
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13267]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme0n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13424]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme0n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13440]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme0n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13267]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme1n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13479]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme1n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13486]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme1n1
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: disk/disk_sb_firm_status_get.cpp:141 Invalid json format: {"compatibility_interval":[{"barebone_installable":true,"compatibility":"support","fw_dsm_update_status_notify":false,"not_yet_rolling_status":"support","smart_attr_ignore":false,"smart_test_ignore":false}]}
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: disk/disk_sb_firm_status_get.cpp:349 Fail to get tmp firmware node
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: space_pool_disk_compat.c:176 Failed to copy '/var/lib/space/pool_compatibility' to '/run/space/pool_compatibility'
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: space_pool_disk_compat.c:179 Failed to chmod : [/run/space/pool_compatibility] , errno=No such file or directory

wjz304 commented 2 months ago

Try updating DB

marboroman commented 2 months ago

It didn't download automatically so I made a manual update with the file SynoOfflinePack-sa6400-787.sa . It wrote that the base was updated but the date didn't changed. it is still 06.06.2024

wjz304 commented 2 months ago

Look at the log, hdddb will still rewrite this file, but I don't know why "Invalid json format" appears.

wjz304 commented 2 months ago

Or try to cancel hdddb addon, update it again, and check the log or the return value of nvme command to see if T1/T2 is still -273

marboroman commented 2 months ago

ok, but which command does show -273 now? we saw -273 just in logs

wjz304 commented 2 months ago

nvme get-feature /dev/nvme0 -f 04 -H what is returned?

wjz304 commented 2 months ago

root@test:~# nvme set-feature /dev/nvme0 -f 0x04 -v 0x0146
set-feature:04 (Temperature Threshold), value:0x000146
root@test:~# nvme get-feature /dev/nvme0 -f 04 -H
get-feature:0x04 (Temperature Threshold), Current value: 0x000146
        Threshold Type Select         (THSEL): 0 - Over Temperature Threshold
        Threshold Temperature Select (TMPSEL): 0 - Composite Temperature
        Temperature Threshold         (TMPTH): 53 C
root@test:~# nvme set-feature /dev/nvme0 -f 0x04 -v 0x0161 
set-feature:04 (Temperature Threshold), value:0x000161
root@test:~# nvme get-feature /dev/nvme0 -f 04 -H
get-feature:0x04 (Temperature Threshold), Current value: 0x000161
        Threshold Type Select         (THSEL): 0 - Over Temperature Threshold
        Threshold Temperature Select (TMPSEL): 0 - Composite Temperature
        Temperature Threshold         (TMPTH): 80 C

I tested it, it should be OK

marboroman commented 2 months ago

nvme get-feature /dev/nvme0 -f 04 -H what is returned?

get-feature:0x04 (Temperature Threshold), Current value: 0x000165 Threshold Type Select (THSEL): 0 - Over Temperature Threshold Threshold Temperature Select (TMPSEL): 0 - Composite Temperature Temperature Threshold (TMPTH): 84 C

marboroman commented 2 months ago

the databese is still 06.06.2024 and now it says it's uptodate although there is a message in storage manager saing that the firmware of the ssd is not regognized and database needs to be upgraded. totaly stange behaviour. will try to scan theose ssds with some kind of livecd

RROrg / rr

NVMe drives started to disconnect randomly #1185