Open marboroman opened 3 months ago
/var/log/messages /var/log/kern.log /var/log/disk.log /var/log/synobootup.log
please note: I used nvmesystem plugin for this install. Maybe that plugin causes such problems.
wjz304, did you check my log files? Please help to find out what happened.
Did you execute some other scripts? (For example, https://github.com/007revad/Synology_HDD_db)
Is there an exact time when the disk was lost? There are a lot of logs, and I can't be sure which ones are the key information.
Is there an exact time when the disk was lost? There are a lot of logs, and I can't be sure which ones are the key information.
13.06.2024 13:12 GMT+3
other scripts
misc: "" acpid: "" reboottoloader: "" hdddb: "" storagepanel: TOWER_5_Bay 1X2
# messages
2024-06-13T12:40:59+03:00 ASUNAS synostgd-disk[11390]: disk_monitor.c:289 The temperature[51] of /dev/nvme0n1 >= T2. (T1: -273, T2: -273)
2024-06-13T12:40:59+03:00 ASUNAS synostgd-disk[11392]: disk_monitor.c:289 The temperature[51] of /dev/nvme1n1 >= T2. (T1: -273, T2: -273)
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.529751] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.541278] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.575029] nvme nvme0: Removing after probe failure status: -19
2024-06-13T12:42:00+03:00 ASUNAS kernel: [ 1141.591977] nvme nvme1: Removing after probe failure status: -19
The preset values of the hard disk temperature are T1: -273, T2: -273, which causes the temperature 51 to be considered to exceed the upper limit.
The preset values of the hard disk temperature are T1: -273, T2: -273, which causes the temperature 51 to be considered to exceed the upper limit.
is set by hdddb? how to fix it?
nvme get-feature /dev/nvme0
See what information is returned?
and I checked the temperatures for those nvmes earlier. They were about 51C for the whole week. And it was stable
I don't know, I haven't encountered this situation, in theory these values are related to the disk firmware
# get
nvme get-feature /dev/nvme0
# set
nvme set-feature /dev/nvme0 -f 0x04 -v 343 -s 353
# temp
nvme smart-log /dev/nvme0 | grep temperature
Let's try it first
nvme get-feature /dev/nvme0
feature-id required param
nvme set-feature /dev/nvme0 -f 0x04 -v 343 -s 353
NVMe Status:FEATURE_NOT_SAVEABLE(410d)
nvme smart-log /dev/nvme0 | grep temperature temperature : 35 C
just turned it on. that's why it's 35 for a while.
There is another problem, macs is empty, Did you delete it from the log, or is it empty by default?
i deleted sn and macs
i deleted sn and macs
OK, nothing, just some "Failed to connect server" errors.
2024-06-06T01:55:22+03:00 ASUNAS synostgdisk[12194]: space_pool_disk_compat.c:176 Failed to copy '/var/lib/space/pool_compatibility' to '/run/space/pool_compatibility'
2024-06-06T01:55:22+03:00 ASUNAS synostgdisk[12194]: space_pool_disk_compat.c:179 Failed to chmod : [/run/space/pool_compatibility] , errno=No such file or directory
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13267]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme0n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13424]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme0n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13440]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme0n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13267]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme1n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13479]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme1n1
2024-06-06T01:55:22+03:00 ASUNAS syno_disk_data_collector[13486]: smartctl/smartctl_nvme_smart_info_get.c:232 Failed to load attribute DB of disk /dev/nvme1n1
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: disk/disk_sb_firm_status_get.cpp:141 Invalid json format: {"compatibility_interval":[{"barebone_installable":true,"compatibility":"support","fw_dsm_update_status_notify":false,"not_yet_rolling_status":"support","smart_attr_ignore":false,"smart_test_ignore":false}]}
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: disk/disk_sb_firm_status_get.cpp:349 Fail to get tmp firmware node
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: space_pool_disk_compat.c:176 Failed to copy '/var/lib/space/pool_compatibility' to '/run/space/pool_compatibility'
2024-06-06T01:55:23+03:00 ASUNAS synostgdisk[13693]: space_pool_disk_compat.c:179 Failed to chmod : [/run/space/pool_compatibility] , errno=No such file or directory
Try updating DB
It didn't download automatically so I made a manual update with the file SynoOfflinePack-sa6400-787.sa . It wrote that the base was updated but the date didn't changed. it is still 06.06.2024
Look at the log, hdddb will still rewrite this file, but I don't know why "Invalid json format" appears.
Or try to cancel hdddb addon, update it again, and check the log or the return value of nvme command to see if T1/T2 is still -273
ok, but which command does show -273 now? we saw -273 just in logs
nvme get-feature /dev/nvme0 -f 04 -H what is returned?
root@test:~# nvme set-feature /dev/nvme0 -f 0x04 -v 0x0146
set-feature:04 (Temperature Threshold), value:0x000146
root@test:~# nvme get-feature /dev/nvme0 -f 04 -H
get-feature:0x04 (Temperature Threshold), Current value: 0x000146
Threshold Type Select (THSEL): 0 - Over Temperature Threshold
Threshold Temperature Select (TMPSEL): 0 - Composite Temperature
Temperature Threshold (TMPTH): 53 C
root@test:~# nvme set-feature /dev/nvme0 -f 0x04 -v 0x0161
set-feature:04 (Temperature Threshold), value:0x000161
root@test:~# nvme get-feature /dev/nvme0 -f 04 -H
get-feature:0x04 (Temperature Threshold), Current value: 0x000161
Threshold Type Select (THSEL): 0 - Over Temperature Threshold
Threshold Temperature Select (TMPSEL): 0 - Composite Temperature
Temperature Threshold (TMPTH): 80 C
I tested it, it should be OK
nvme get-feature /dev/nvme0 -f 04 -H what is returned?
get-feature:0x04 (Temperature Threshold), Current value: 0x000165 Threshold Type Select (THSEL): 0 - Over Temperature Threshold Threshold Temperature Select (TMPSEL): 0 - Composite Temperature Temperature Threshold (TMPTH): 84 C
the databese is still 06.06.2024 and now it says it's uptodate although there is a message in storage manager saing that the firmware of the ssd is not regognized and database needs to be upgraded. totaly stange behaviour. will try to scan theose ssds with some kind of livecd
Install ENV:
RR version:
RR: 24.6.0
modules:
misc: "" acpid: "" reboottoloader: "" hdddb: "" storagepanel: TOWER_5_Bay 1X2
lkms:
prod
DSM:
Issue:
Everything was fine for a week: stable operation without reboots. The baremetal installation has a storage pool consisting of 2 nvme disks. Suddenly tonight one nvme fell off. The pool remained running on the second disk in degraded status. I rebooted and the first disk appeared again but the pool needed recovery. After restoration, everything worked fine for 3 hours and suddenly both nvme disks fell off at once. After the reboot, one of them appeared and the pool is still operating in a degraded status. This is some kind of disaster. Everything worked fine for a week!
logs: please let me know wich log files needed to investigate...