Virtual SATA Disk (ESXi) Health Status Not Supported (Preventing Storage Pool Creation)

RedPill-TTG / redpill-lkm

Linux kernel module for RedPill

GNU General Public License v3.0

307 stars 170 forks source link

Virtual SATA Disk (ESXi) Health Status Not Supported (Preventing Storage Pool Creation) #14

Closed ilovepancakes95 closed 3 years ago

ilovepancakes95 commented 3 years ago

Can install DSM 7 (3615xs) just fine on ESXi with latest lkm release, however when I try to create a storage pool/volume in DSM, despite the virtual SATA disk showing up okay in Storage Manager, it blocks use of it because the "Health" status is "Not Supported" according to DSM. I know actual "health status" and even SMART will not work/be useful with virtual disks but DSM 7 will at least need to think the virtual disk has a status of "Healthy" in order to let the disk be used. In Jun's loader, the health status shows "Healthy" for virtual disks, even though when you click "Health Info" it shows "Access Error" and no actual health stats.

I am assuming there is a flag somewhere that gets written in DSM on whether a disk supports health status or not and then the actual status of the disk. I did some initial digging and found "/run/synostorage/disks/sdb" contains files that appear to show disk information and certain compatibility flags. While nothing in there says "health" I compared these files with a real DSM 7 system and changed some of them to reflect "normal", "supported", etc. in the appropriate places. This doesn't produce any immediate changes in DSM or Storage Manager for the disk so I tried rebooting DSM but the values in that folder change back. I am assuming there will need to be a way to shim synostorage into thinking the disks are all healthy before DSM is loaded fully. Not sure where to go from here. Some pictures are below.

397392647_ScreenShot2021-08-14at4_08_10PM png f8798545b49f0bec3686f8e8826ce37e

1266430058_ScreenShot2021-08-14at4_08_18PM png 9c5308b62553cac963ab27d49350d9f6

Screen Shot 2021-08-16 at 11 11 37 AM

ttg-public commented 3 years ago

Hm, we don't think the emulation will be needed as these are normally populated. Good catch that they're not anymore. It may be a problem with failing hddmon module or some new mfgBIOS shims needed. We're exploring that (or actually will just after merging v7 on 918).

Edit: another idea (we didn't test it yet) but maybe the problem is even simpler: the disk is beyond the standard supported number of disks in the UI for 3615. After all we're artificially forcing more disks that 3615 supports. That idea comes from the fact that 918+ actually does work and creates arrays... or maybe we hit some accidental bug as v7 on 918 and v7 on 3615 run a very different kernel and mfgBIOS hmm...

ilovepancakes95 commented 3 years ago

Edit: another idea (we didn't test it yet) but maybe the problem is even simpler: the disk is beyond the standard supported number of disks in the UI for 3615. After all we're artificially forcing more disks that 3615 supports. That idea comes from the fact that 918+ actually does work and creates arrays... or maybe we hit some accidental bug as v7 on 918 and v7 on 3615 run a very different kernel and mfgBIOS hmm...

I'm not sure I follow... you mean because the disk shows up as Disk 2 and nothing shows as Disk 1? The Storage Manager UI does show a graphic of a DS3615xs with 12 drive slots. Is there something in the backend of lkm you're saying that will allow more than 12 disks?

From the way storage manager words the error when creating a pool combined with the way virtual disks appear and work on Jun's loaders (if we can assume for a second nothing else major changed in DSM 7), I think getting DSM to simply report on it's own "Healthy" in the below image instead of "Not Supported" for health status, should be enough to let the pool creation work. I can't pinpoint where in DSM the decision is being made to mark a disk "Not Supported" vs "Healthy" in Storage Manager.

Screen Shot 2021-08-17 at 9 29 30 AM

ttg-public commented 3 years ago

I'm not sure I follow... you mean because the disk shows up as Disk 2 and nothing shows as Disk 1? The Storage Manager UI does show a graphic of a DS3615xs with 12 drive slots. Is there something in the backend of lkm you're saying that will allow more than 12 disks?

Yes, we set it to 15 similarly to how Jun's loader set it as it was confirmed to be working with 15 drivers + boot (but with our kernel-mode fix it would probably work with full 16 too). However, that was always a hack and DSM7 may have a problem with that as even the graphics doesn't fit (as you said it shows 12).

From the way storage manager words the error when creating a pool combined with the way virtual disks appear and work on Jun's loaders (if we can assume for a second nothing else major changed in DSM 7), I think getting DSM to simply report on it's own "Healthy" in the below image instead of "Not Supported" for health status, should be enough to let the pool creation work. I can't pinpoint where in DSM the decision is being made to mark a disk "Not Supported" vs "Healthy" in Storage Manager.

The thing is the healthy/not healthy status does work on v6 (and shows healthy on v6). Additionally v6 doesn't have the hddmon module. So it suggests that v7 did change something.

ilovepancakes95 commented 3 years ago

The thing is the healthy/not healthy status does work on v6 (and shows healthy on v6).

Do you know where in DSM 6 the decision is made by the system to mark a drive "Healthy" or not? I did some preliminary searching around but can't find where that code is implemented. Also, does creating storage pools with virtual disks on Proxmox work okay? I don't have a Proxmox system to test with.

ttg-public commented 3 years ago

From preliminary grepping around on v6:

The webui side calls API method SYNO.Storage.CGI.HddMan
The method seems to be defined in /usr/syno/synoman/webapi/SYNO.Storage.CGI.lib (a JSON file)
The definition file lists /usr/syno/synoman/webapi/lib/libStorage.so as the target for execution of that method
Looking at ldd for that .so reveals quite a list. However the following ones, guessing by names, may contain what ACTUALLY checks the disk for health:
- libsynostoragewebutils.so
- libsynostorage.so.6
- libsynostoragemgmt.so
- libsynostoragecore.so.6
- libhwcontrol.so
- libstoragemanager.so

One crucial detail which is worth mentioning is that it doesn't show that the disk is NOT failing or anything like that but "Not Supported" which indicates it couldn't access that info rather that it determined that the disk is not suitable.

As for proxmox + 3615xs + v7 we can SWEAR it was broken and showed the drive as "Not Supported" but we just generated a fresh image, installed v7 and... it does show the drive as Healthy and creates the pool without a problem. The only warning/"error" we get is the standard one about disk not being on the HCL.

p.s. As for the broken General tab... it's a JS error - the API returns data but the JS on that tab fails and an error is logged to the standard Chrome web console.

ilovepancakes95 commented 3 years ago

I've made some progress on figuring out what's going on here and was able to create a storage pool with a hackey kind of method! I booted up Kali and Burp Suite to try and pinpoint where the DSM GUI is making the request for disk health status. Full write up below with some further notes in between the screenshots.

Turns out, when you open Storage Manager, webapi requests are made to underlying DSM for disk info. I used Burp to intercept the server response from DSM before the GUI picked it up. Noticed there was one response with a boat load of info about the sbd disk (the one data disk I have added to the ESXi VM). Trial and error revealed that changing the intercepted "overview_status" value from "unknown" to "normal" (highlighted with red box) before the GUI received the response back works to make the GUI report "Healthy" on the disk. I changed nothing else in the server response being intercepted.

Screen Shot 2021-08-17 at 8 27 15 PM

Screen Shot 2021-08-17 at 8 27 41 PM

Screen Shot 2021-08-17 at 8 27 58 PM

Once the GUI reported a "Healthy" disk, Storage Manager let me continue to create a storage pool as normal.

Screen Shot 2021-08-17 at 8 29 13 PM

It does complain about the disk not being on Synology's compatibility list but lets me continue past that error.

Screen Shot 2021-08-17 at 8 29 41 PM

Screen Shot 2021-08-17 at 8 30 01 PM

Screen Shot 2021-08-17 at 8 30 08 PM

Screen Shot 2021-08-17 at 8 30 15 PM

Then I created a shared folder on the volume and it works! Let's me write data, create folders, etc.

Screen Shot 2021-08-17 at 8 31 40 PM

Here's the thing though.... Once I stop the intercept in Burp Suite, or once Storage Manager refreshes and pulls the disk info again (such as after the new volume is made) of course the "overview_status" value reverts back to "unknown" making the disk Health Status show "Not Supported" again. BUT, since the volume is already created, DSM allows it to still be used, but with a warning in Storage Manager about this "abnormal" status.

Screen Shot 2021-08-17 at 8 19 57 PM

In conclusion, I believe the code in DSM 7 preventing volumes from being created on ESXi is simply done in the GUI/web interface, not on the backend, because changing the HTTP response let the process continue. BUT, of course this still raises the question of why DSM is returning "unknown" in "overview_status" in the first place. What is generating that response that gets sent to the GUI Storage Manager? Find that out and shim it to always report "normal" instead and I think we have our fix. That being said, after this whole process, I have not looked at the backend DSM logs or anything else to see if it still complains about the volume being hacked together with this above method.

Scoobdriver commented 3 years ago

Also have this issue on esxi 6.7 (bromolow 7)

MartinKuhl commented 3 years ago

Me too with Parallels on bromolow with DSM 7

ilovepancakes95 commented 3 years ago

One crucial detail which is worth mentioning is that it doesn't show that the disk is NOT failing or anything like that but "Not Supported" which indicates it couldn't access that info rather that it determined that the disk is not suitable.

Noticed you posted your reply while I was writing up my research, where I got a storage pool creation to work by intercepting some HTTP traffic and modifying a json response (see https://github.com/RedPill-TTG/redpill-lkm/issues/14#issuecomment-900732887). This lines up with what you're saying about DSM not simply being able to access the right info to mark as healthy.

I was grepping around on DSM 7 and noticed it contains similar sets of files as you mention for the synostorage module and webapi so hopefully not too much has changed between 6 and 7.

OrpheeGT commented 3 years ago

@ilovepancakes95 your analysis is very interesting.

with my current Jun's loader 6.2.3 with LSI card passthrough, the disks are real detected with SMART data working.

It would be interesting to test LSI passthrough on 7.0 bromolow to check how disks are handled. but I don't have spare disks to try, and I'm not ready to risk my prod datas...

dperez-sct commented 3 years ago

Few instructions to skip this, adapt to your own configuration:

Enable SSH and connect to Syno.
Locate your disk and verify if "Syno layout" is created. If created, skip to step 3.1. If the disk is empty, step 3.2:

/dev/sdb  /dev/sdb1  /dev/sdb2
ash-4.4# fdisk -l /dev/sdb
Disk /dev/sdb: 300 GiB, 322122547200 bytes, 629145600 sectors
Disk model: Virtual SATA Hard Drive
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xbd7c953e

Device     Boot   Start     End Sectors  Size Id Type
/dev/sdb1          2048 4982527 4980480  2.4G fd Linux raid autodetect
/dev/sdb2       4982528 9176831 4194304    2G fd Linux raid autodetect

3.1. Create last partition using free space (Only if Syno layout already exists):

ash-4.4# fdisk /dev/sdb

Welcome to fdisk (util-linux 2.33.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): n
Partition type
   p   primary (2 primary, 0 extended, 2 free)
   e   extended (container for logical partitions)
Select (default p):

Using default response p.
Partition number (3,4, default 3):
First sector (9176832-629145599, default 9177088):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (9177088-629145599, default 629145599):

Created a new partition 3 of type 'Linux' and of size 295.6 GiB.

Command (m for help): w
The partition table has been altered.
Syncing disks.

3.2. Create Syno layout, only if the disk is empty:

synopartition --part /dev/sdb 12

Create basic disk (Only tested this. Sure it's possible to create RAID5 or RAID6 but not tried):

ash-4.4# mdadm --create /dev/md2 --level=1 --raid-devices=1 --force /dev/sdb3
mdadm: Note: this array has metadata at the start and
may not be suitable as a boot device.  If you plan to
store '/boot' on this device please ensure that
your boot-loader understands md/v1.x metadata, or use
--metadata=0.90
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md2 started.

(if md2 already exist, you should use next md number. The same with sd*3, of course)

Create FS


ash-4.4# mkfs.btrfs -f /dev/md2
btrfs-progs v4.0
See http://btrfs.wiki.kernel.org for more information.

WARNING: failed to open /dev/btrfs-control, skipping device registration: No such file or directory Label: (null) UUID: a8d5d1c8-9557-4e46-9e97-ced58471a89e Node size: 16384 Sector size: 4096 Filesystem size: 295.62GiB Block group profiles: Data: single 8.00MiB Metadata: DUP 1.01GiB System: DUP 12.00MiB SSD detected: no Incompat features: extref, skinny-metadata Number of devices: 1 Devices: ID SIZE PATH 1 295.62GiB /dev/md2



6. Finally reboot. Once rebooted, you should view in the Storage Manager a new Available Pool. You only need to click on "Online assemble" and apply to start to use it :D

ttg-public commented 3 years ago

TL;DR: it's SMART. v7 requires it. We went ahead and wrote an emulator of SMART for disks without it :D

The shim is part of the commit https://github.com/RedPill-TTG/redpill-lkm/commit/d032ac48c00f78b11516f2f2112970bc51bf68be - can you guys try the newest release and report if it solves your issues? We've tested on ESXi 7.0.2 and it seems to be working flawlessly without any hacks.

OrpheeGT commented 3 years ago

Hello,

I'm trying to play with new build.

I have a dedicated virtual SATA controller for loader (SATA0:1) and another virtual SATA controller for disks (SATA1:0)

It worked with this before

Where are my current settings :

{ "extra_cmdline": { "pid": "0x0001", "vid": "0x46f4", "sn": "1330LWNXXXXXX", "mac1": "001132XXXXXX", "DiskIdxMap": "1000", "SataPortMap": "4", }, "synoinfo": { "supportsystemperature": "no", "supportsystempwarning": "no" }, "ramdisk_copy": {} }

I want to use a SAS HBA IT controller so I prepared a loader with supportsas enabled but currently SAS card is not plugged to VM. Only a virtual disk added to virtual SATA1.

If I add "supportsas": "yes", in synoinfo No disk detected at all at boot, can't install DSM.

If I remove supportsas line, disk is detected but install fails at 55%

once booted, serial log spam : parameter error. gpiobase=00000000, pin=4, pValue=ffff88013549f9bc

7.0_21.log

Serial console log in attachment

thanks

OrpheeGT commented 3 years ago

Actually it works with 6.2.4 loader Disk available for DSM install, and can be used inside DSM GUI. I will build again 7.0 to check again...

6.2.4_21.log

Edit : I confirm it fails on 7.0 I also tried 7.0.1 from jumkey and same happens 7.0.1_21.log

kaileu commented 3 years ago

Now it works like a charm even on 7.0.1

Scoobdriver commented 3 years ago

disk is detected but install fails at 55%

I'm seeing the same as @OrpheeGT
on both apollolake and bromolow 7.0 and 7.01

ilovepancakes95 commented 3 years ago

I too am getting the same as @OrpheeGT and @Scoobdriver, trying to install DSM 7 3615xs now fails again at 55% after working flawlessly in previous releases. I confirmed I am using SATA boot menu option, but /var/log/messages shows the following:

Sep 13 16:37:33 install.cgi: Pass checksum of /tmpData/upd@te...
Sep 13 16:37:33 updater: updater.c:6504 Start of the updater...
Sep 13 16:37:33 updater: updater.c:3120 orgBuildNumber = 41222, newBuildNumber=41222
Sep 13 16:37:33 updater: util/updater_util.cpp:86 fail to read company in /tmpRoot//etc.defaults/synoinfo.conf
Sep 13 16:37:33 updater: updater.c:6782 ==== Start flash update ====
Sep 13 16:37:33 updater: updater.c:6786 This is X86 platform
Sep 13 16:37:33 updater: boot/boot_lock.c(228): failed to mount boot device /dev/synoboot2 /tmp/bootmnt (errno:2)
Sep 13 16:37:33 updater: updater.c:6247 Failed to mount boot partition
Sep 13 16:37:33 updater: updater.c:3027 No need to reset reason for v.41222 
Sep 13 16:37:33 updater: updater.c:7389 Failed to accomplish the update! (errno = 21)
Sep 13 16:37:33 install.cgi: ninstaller.c:1454 Executing [/tmpData/upd@te/updater -v /tmpData > /dev/null 2>&1] error[21]
Sep 13 16:37:34 install.cgi: ninstaller.c:121 Mount partion /dev/md0 /tmpRoot
Sep 13 16:37:34 install.cgi: ninstaller.c:1423 Moving updater for configuration upgrade...cmd=[/bin/mv -f /tmpData/upd@te/updater /tmpRoot/.updater > /dev/null 2>&1]
Sep 13 16:37:34 install.cgi: ninstaller.c:150 umount partition /tmpRoot
Sep 13 16:37:34 install.cgi: ErrFHOSTCleanPatchDirFile: After updating /tmpData/upd@te...cmd=[/bin/rm -rf /tmpData/upd@te > /dev/null 2>&1]
Sep 13 16:37:34 install.cgi: ErrFHOSTCleanPatchDirFile: Remove /tmpData/upd@te.pat...
Sep 13 16:37:34 install.cgi: ErrFHOSTDoUpgrade(1702): child process failed, retv=-21
Sep 13 16:37:34 install.cgi: ninstaller.c:1719(ErrFHOSTDoUpgrade) err=[-1]
Sep 13 16:37:34 install.cgi: ninstaller.c:1723(ErrFHOSTDoUpgrade) retv=[-21]
Sep 13 16:37:34 install.cgi: install.c:409 Upgrade by the manual patch fail.
Sep 13 16:37:34 install.cgi: install.c:678 Upgrade by the uploaded patch /tmpData/@autoupdate/upload.pat fail.
Jan  1 00:00:00 install.cgi: ninstaller.c:150 umount partition /tmpData
Jan  1 00:00:00 kernel: [   97.116227] parameter error. gpiobase=00000000, pin=4, pValue=ffff8801348279bc
Jan  1 00:00:01 scemd: scemd.c:921 microP get error
Jan  1 00:00:01 kernel: [   98.375411] parameter error. gpiobase=00000000, pin=4, pValue=ffff8801348279bc
Jan  1 00:00:03 kernel: [  100.379401] parameter error. gpiobase=00000000, pin=4, pValue=ffff8801348279bc
Jan  1 00:00:04 scemd: scemd.c:921 microP get error
Jan  1 00:00:04 kernel: [  101.638575] parameter error. gpiobase=00000000, pin=4, pValue=ffff8801348279bc
Jan  1 00:00:06 kernel: [  103.642590] parameter error. gpiobase=00000000, pin=4, pValue=ffff8801348279bc

labrouss commented 3 years ago

Hi,

its looks the sata shim fails with :

[ 9.296466] <redpill/boot_device_shim.c:48> Registering boot device router shim [ 9.297930] <redpill/native_sata_boot_shim.c:205> Registering native SATA DOM boot device shim [ 9.298931] BUG: unable to handle kernel NULL pointer dereference at (null) [ 9.300841] IP: [] register_native_sata_boot_shim+0x33/0x1d0 [redpill] [ 9.302811] PGD 137f9c067 PUD 137f9b067 PMD 0 [ 9.303770] Oops: 0000 [#1] SMP [ 9.303944] Modules linked in: redpill(OF+) [ 9.304117] CPU: 3 PID: 521 Comm: insmod Tainted: GF O 3.10.108 #42214 [ 9.304291] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019 [ 9.304465] task: ffff880137d68820 ti: ffff880135d4c000 task.ti: ffff880135d4c000 [ 9.314116] RIP: 0010:[] [] register_native_sata_boot_shim+0x33/0x1d0 [redpill] [ 9.315921] RSP: 0018:ffff880135d4fd68 EFLAGS: 00010296 [ 9.316706] RAX: 0000000000000000 RBX: ffffffffa001a860 RCX: 00000000000000b6 [ 9.316880] RDX: 0000000000000052 RSI: 0000000000000246 RDI: ffffffff81abd668 [ 9.317054] RBP: ffffffffa0023000 R08: ffffffff818a90c8 R09: 00000000000004e4 [ 9.318922] R10: ffffffff816db470 R11: 61735f6576697461 R12: ffffffffa001aa50 [ 9.319096] R13: ffffffffa001aa18 R14: ffff880137e9e5c0 R15: ffffffffa001aa00 [ 9.319270] FS: 00007ff81ab7d740(0000) GS:ffff88013dcc0000(0000) knlGS:0000000000000000 [ 9.319443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 9.319617] CR2: 0000000000000000 CR3: 0000000135d8e000 CR4: 00000000003607e0 [ 9.321922] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 9.335490] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 9.337118] Stack: [ 9.337292] ffffffffa001a860 ffffffffa00099a8 0000000000000000 ffffffffa00230a3 [ 9.339906] ffffffff810002ea ffff880135d4fec8 ffff880135d4feb8 ffffffffa001aa50 [ 9.341074] ffffffffa001aa18 ffffffff81095149 ffffffff812a2cb0 00007ffffffff000 [ 9.341248] Call Trace: [ 9.341436] [] ? register_bootshim+0xc8/0x150 [redpill] [ 9.342819] [] ? init+0xa3/0x16c [redpill] [ 9.343004] [] ? do_one_initcall+0x2a/0x170 [ 9.343189] [] ? load_module+0x1b89/0x2540 [ 9.343567] [] ? ddebug_proc_write+0xf0/0xf0 [ 9.344728] [] ? SYSC_finit_module+0x7d/0xc0 [ 9.346127] [] ? system_call_fastpath+0x1c/0x21 [ 9.347596] Code: c0 a6 a0 01 a0 b9 cd 00 00 00 48 c7 c2 e7 61 01 a0 48 c7 c6 8a a0 01 a0 48 c7 c7 58 65 01 a0 e8 14 83 49 e1 48 8b 05 6d 37 01 00 <44> 8b 08 41 83 f9 01 0f 85 c0 00 00 00 80 3d 51 37 01 00 00 0f [ 9.359557] RIP [] register_native_sata_boot_shim+0x33/0x1d0 [redpill] [ 9.360935] RSP [ 9.361108] CR2: 0000000000000000 [ 9.361302] ---[ end trace 82cdac642818971f ]--- [ 9.363239] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) [ 9.363941] software IO TLB [mem 0xbbed0000-0xbfed0000] (64MB) mapped at [ffff8800bbed0000-ffff8800bfecffff] [ 9.365028] Simple Boot Flag at 0x36 set to 0x80 [ 9.371605] <redpill/intercept_driver_register.c:94> driver_register() interception active - no handler observing "alarmtimer" found, calling original driver_register() [ 9.381719] <redpill/memory_helper.c:18> Disabling memory protection for page(s) at ffffffff81326de0+12/1 (<<ffffffff81326000) [ 9.385013] <redpill/override_symbol.c:246> Obtaining lock for [ 9.385954] <redpill/override_symbol.c:246> Writing original code to

ttg-public commented 3 years ago

It should be fixed now - we borked a rebase causing a variable to be initialized after it's checked and not before. So practically it affected anybody trying 3615xs with SATA boot.

I have a dedicated virtual SATA controller for loader (SATA0:1) and another virtual SATA controller for disks (SATA1:0)

@OrpheeGT: FYI: you don't need to use two separate controllers (unless you have some other reason).

I want to use a SAS HBA IT controller so I prepared a loader with supportsas enabled but currently SAS card is not plugged to VM. Only a virtual disk added to virtual SATA1.

If I add "supportsas": "yes", in synoinfo No disk detected at all at boot, can't install DSM.

@OrpheeGT: As far as we know (we didn't dig deeper into that) you cannot simply add supportsas = yes in synoinfo as there are kernel parts which are hardcoded on models with vs. without SAS. As discussed on the forum we will probably just mark SAS disks as SATA (like in VirtIO) for models without native SAS support.

once booted, serial log spam : parameter error. gpiobase=00000000, pin=4, pValue=ffff88013549f9bc

7.0_21.log

Serial console log in attachment

thanks

@OrpheeGT: Thank you for the log. That gpio spam means that module unloaded/crashed and we can see in the log why.

Edit : I confirm it fails on 7.0 I also tried 7.0.1 from jumkey and same happens 7.0.1_21.log

@OrpheeGT: Try again now :)

Hmm is there a way to check if the smart shim is correctly loaded ?

@kaileu: It looks like you have ESXi with SCSI disk - that may not work actually as we didn't test a combo of SCSI+ESXi but SATA+ESXi. You can try again with new fix. As to SMART you should do smartctl -d ata -A /dev/sdX as you want to use the ATAPI interface like syno does.

disk is detected but install fails at 55%

I'm seeing the same as @OrpheeGT on both apollolake and bromolow 7.0 and 7.01

@Scoobdriver: Should now work.

I too am getting the same as @OrpheeGT and @Scoobdriver, trying to install DSM 7 3615xs now fails again at 55% after working flawlessly in previous releases. I confirmed I am using SATA boot menu option, but /var/log/messages shows the following:
...
Jan  1 00:00:06 kernel: [  103.642590] parameter error. gpiobase=00000000, pin=4, pValue=ffff8801348279bc

@ilovepancakes95: GPIO flood on 918 means that the module didn't load/crashed totally. So when you see something like that scroll up in the log and see if there are any errors beforehand.

[ 9.297930] <redpill/native_sata_boot_shim.c:205> Registering native SATA DOM boot device shim [ 9.298931] BUG: unable to handle kernel NULL pointer dereference at (null)

@labrouss: Yup, it's the broken rebase right there. Can you check again?

labrouss commented 3 years ago

Tested 3615 loader on VMware Workstation :

Installation and Volume creation on new disk, works like a charm
Migration from 918+ works like a charm

OrpheeGT commented 3 years ago

@ttg-public Well as my LSI HBA IT card needs mpt2sas module, and it is already part inside the loader. looking at linuxrc.syno.impl it seemed the best way to load natively mpt2sas module was to add it to synoinfo with supportsas=yes

Edit : about SATA1 controller, if I remove "DiskIdxMap": "1000", "SataPortMap": "4" and use same SATA1 controller for both loader and data disk, DSM ask me if I want to erase 2 disks instead of only the one on SATA2 controller.

ttg-public commented 3 years ago

@ttg-public Well as my LSI HBA IT card needs mpt2sas module, and it is already part inside the loader. looking at linuxrc.syno.impl it seemed the best way to load natively mpt2sas module was to add it to synoinfo with supportsas=yes

The problem with changing supportsas on 918 platform is that apollolake kernel wasn't built with syno SAS modifications (CONFIG_SYNO_SAS_ kernel params) so you will probably step into some weird issues related to that.

Edit : about SATA1 controller, if I remove "DiskIdxMap": "1000", "SataPortMap": "4" and use same SATA1 controller for both loader and data disk, DSM ask me if I want to erase 2 disks instead of only the one on SATA2 controller.

What is the message saying precisely? This message in the installer is very confusing if you have a single disk. It will say that it will erase "2 disks" where it DOESN'T mean "TWO DISKS" but "DISK TWO". If you have more than one then the message says "2 4 11 disks" which is clunky but makes sense (so you know that the "disks" relates to disks number 2, 4, and 11).

OrpheeGT commented 3 years ago

@ttg-public I'm only using DS3615xs plateform as my CPU is not compliant with DS918+...

Does it mean even enabling mpt2sas module (with supportsas or manually loading it) is not enough to make LSI HBA passtrough card work ?

You may be right about TWO DISKS vs DISK TWO, I will test again.

OrpheeGT commented 3 years ago

You are right about disk number : First time it shows disk 3 instead of disk 2 actually

OrpheeGT commented 3 years ago

And with Supportsas = yes enabled. No disk detected 7.0.1_21_supportsas.log

As a reminder my LSI card is not enabled but I expected to see the virtual 16Gb disk at least. It seems enabling supportsas make virtual disk OFF.

MartinKuhl commented 3 years ago

Some minutes ago I was able to create a volume via the DSM storage manager within a Parallels VM. The Issue that I have is that nearly every minute the following message appears:

Any idea how to solve that?

ttg-public commented 3 years ago

@OrpheeGT Hm, if you're using 3615 it should support SAS out of the box, but maybe they filter for only their own SAS controllers. We can try to force all SAS ports to be seen as SATA but we don't feel confident to just publish it without testing as we're lacking a free LSI card on hand where we can test (but soon we will get it ;)). We saw you created an issue https://github.com/RedPill-TTG/redpill-lkm/issues/19 so lets continue the discussion about the SAS specifically there.

@MartinKuhl can you tell us something more about the config? Does it only happen on parallel or did you just test on Parallels? IDNF error means that syno thinks the drive disconnected and reconnected. We expect that to happen only once during boot (if any) but never when the web UI is active. If this is a Parallels-only issue not related to ESXi can you create a new issue with a full dmesg log after that happens? It could be that Parallels emulated HDD doesn't like SMART commands being sent to it when it doesn't support them. It shouldn't care, it's valid according to the ATA/ATAPI spec but few of the guys who use Macs moved to VMWare as it's free since Parallels is really buggy when used with something else than Windows.

MartinKuhl commented 3 years ago

Hi @ttg-public This issue only appears with Parallels, it is the only tool I am using for testing. So I will create a new ticket for this.

ilovepancakes95 commented 3 years ago

It should be fixed now - we borked a rebase causing a variable to be initialized after it's checked and not before. So practically it affected anybody trying 3615xs with SATA boot.

Yep, confirming it works now with commit 021ed51. Thank you!

ranydb commented 2 years ago

my DSM Version: 918+_7.0.1_42218 HBA card: LSI 9400 16i problem: Preventing Storage Pool Creation,Health Status Not Supported

but in terminal use smartctl , I got normal information.

root@test:/etc# smartctl -a /dev/sdj
smartctl 6.5 (build date Feb 20 2021) [x86_64-linux-4.4.180+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Ultrastar
Device Model:     WDC  WUH721818ALE6L4
Serial Number:    3WKVUKLK
LU WWN Device Id: 5 000cca 284f67823
Firmware Version: PCGNW232
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   Unknown(0x0ffc) (unknown minor revision code: 0x009c)
SATA Version is:  SATA >3.2 (0x1ff), 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Oct 15 12:16:49 2021 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1894) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME                                                   FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate                                              0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance                                           0x0005   100   100   054    Pre-fail  Offline      -       0
  3 Spin_Up_Time                                                     0x0007   099   099   001    Pre-fail  Always       -       25769803845
  4 Start/Stop_Count                                                 0x0012   100   100   000    Old_age   Always       -       1
  5 Reallocated_Sector_Count                                         0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate                                                  0x000b   100   100   001    Pre-fail  Always       -       0
  8 Seek_Time_Performance                                            0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power-On_Hours_Count                                             0x0012   100   100   000    Old_age   Always       -       17
 10 Spin_Retry_Count                                                 0x0013   100   100   001    Pre-fail  Always       -       0
 12 Device_Power_Cycle_Count                                         0x0032   100   100   000    Old_age   Always       -       1
 22 Internal_Environment_Status                                      0x0023   100   100   025    Pre-fail  Always       -       100
192 Power_off_Retract_Count                                          0x0032   100   100   000    Old_age   Always       -       5
193 Load_Cycle_Count                                                 0x0012   100   100   000    Old_age   Always       -       5
194 Temperature                                                      0x0002   046   046   000    Old_age   Always       -       46 (Min/Max 25/47)
196 Reallocation_Event_Count                                         0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector_Count                                     0x0022   100   100   000    Old_age   Always       -       0
198 Off-Line_Scan_Uncorrectable_Sector_Count                         0x0008   100   100   000    Old_age   Offline      -       0
199 Ultra_DMA_CRC_Error_Count                                        0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute `delay.`

so any idea?

vnxme commented 2 years ago

Sorry to write in a closed topic. Just wanted to thank @ilovepancakes95 and other contributors and add my 2 cents for those who will google "Access Error" in Synology DSM Storage Manager.

The Storage Manager seems to make a call to smartctl -d ata -A /dev/sdX when evaluating health of each internal disk. Pay attention to the -d ata part - it provides for ATA device type only, so smartctl will output no relevant information if your internal drive is of another type, and that's exactly what causes drive health warning.

In my case, I managed to pass an external USB drive off as an internal one so that it could be recognized by Storage Manager, and it is -d sat option that allowed smartctl to output relevant information. So I managed to make my fake drive status be considered healthy by means of the following simple shim:

mv /bin/smartctl /bin/smartctl-binary

vi /bin/smartctl [change X to the letter of your problematic drive]

#!/bin/bash
if [[ "$@" == "-d ata -A /dev/sdX" ]]; then
/bin/smartctl-binary -d sat -A /dev/sdX
else
/bin/smartctl-binary "$@"
fi

chmod +x /bin/smartctl