Seagate / ToolBin

All the great tools we have for the field.
127 stars 31 forks source link

SeaChest_NVMe Reporting incorrect data for XPG Spectrix #19

Open NavCC opened 3 years ago

NavCC commented 3 years ago

Hello,

I have a test bench with XPG SPECTRIX S40G installed onto an Intel Z390-P, it's incorrectly reporting some data. The drive should have maybe 24 hours if that.

SeachestNvmenormalinfo.txt

eg. Total Bytes Read (ZB), Total Bytes Written, Power On Time.

SeachestBasicVerbose4.txt SeachestNVMEinfo.txt

Please let me know if you require anymore information.

vonericsen commented 3 years ago

Hi @NavCC,

Thanks for the report...those numbers do look very wrong for bytes read/written, and power on time. A quick review of the verbose information looks like the identify commands are working as expected and the command to read these fields also completes, but we'll need to do some more testing to figure out if the drive is returning that, there is a calculation error in the code, or that is garbage data coming back due to some other error.

Can you show the output of SeaChest_SMART -d <handle> --smartAttributes raw since that is another place we dump the same kind of data. This may help us determine where in the code this problem may exist since these options have different paths.

If you are able to, can you share what is reported about this drive from our tools under Linux? Also, comparing to other tools such as crystal disk info and smartmontools can also be helpful for us to understand what is happening.

Anything you can do to get us this info would be very helpful to determine what exactly is going on in this case.

We will do some testing of our own as well, but it will be good to have something else to compare against.

edit: Added --smartAttributes raw output request

NavCC commented 3 years ago

Hello @vonericsen

All other NVMe drives I have tested have come back with the correct drive information, looks like Crystal disk is reporting correctly as is Smartmontools. Here is the SeaChestSMART.txt

Crystal Disk Info.txt

Smartmontools.txt

Not sure what output you wanted for smartmon tools as I'm not familiar with it, went for -x sda

I will test your tools on Linux when I get the chance and update the issue according.

vonericsen commented 3 years ago

Hi @NavCC,

Thanks for the other information!

@vibhutipratapsingh has been chatting with me as he has tested and reviewed the openSeaChest code, and it looks like the reported values are calculating correctly for display in the "common" layer. We have not been able to repeat the problem with what we have access to right now.

Looking at the smartmontools output, it looks like it has a similar issue with what it is reporting.

It appears to be one of two problems at this point:

  1. Something isn't quite right in the low-level Windows code issuing the commands
  2. The drive is doing something odd when it is responding to the get log page command.

If you can tell me anything about what driver is installed, that could also help since we have 3 different NVMe methods in Windows depending on how the drive shows up. If you didn't install any specific NVMe drivers on this system, then it's the included Win10 driver which simplifies this question. This appears to be a different system from the other issue you reported (#11), but if you know if the driver is the same that would help too.

NavCC commented 3 years ago

Hello @vonericsen

Very interesting. It's the Windows 10 driver, nothing vendor specific has been installed.

I thought I would remove Windows from the situation all together and complied the latest release for Linux. Ubuntu 20.04 LTS fresh install. nvmeinfo.log nvmeinfoverbose4.log nvmebasicinfo.log nvmebasicinfoverbose4.log

As you can see the Bytes written is incorrect as is the power on time, I've run both OpenSeaChest_Basic & nvme with verbose on 4.

Yes the system is completely different to issue #11 I've deployed this on many systems with different configurations and intent to report any issues I experience along the way, for majority of drives / configurations it's perfectly fine. We are very happy with the toolset and the support we receive when opening issues.

We use the nvme tools for reporting drive data, entry into a database and creating a graph for the end user.

If there is any other logs you would like me to retrieve in the Windows or Linux workspace I now have the PC dual booting so not an issue.

Thanks @vibhutipratapsingh & @vonericsen for your support.

vonericsen commented 3 years ago

@NavCC, Thanks for the feedback and I'm glad you like the tools! We are doing our best to keep up with the Github issues as well as the ones we receive internally. 😄

Since this seems to follow the drive, I think there is something going on from the drive side that I need some more testing/information on before we'll know the solution.

Can you install nvme-cli (sudo apt install nvme-cli) and then run the following commands? You can see the list of drives in nvme-cli with sudo nvme list

sudo nvme smart-log /dev/nvmehandle and sudo nvme smart-log -n 0 /dev/nvmehandle

It should output something similar to this:

test@test-ubuntu:~$ sudo nvme smart-log /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 37 C available_spare : 100% available_spare_threshold : 5% percentage_used : 97% endurance group critical warning summary: 0 data_units_read : 425,139 data_units_written : 525 host_read_commands : 4,964,108 host_write_commands : 4,096 controller_busy_time : 1 power_cycles : 72 power_on_hours : 11,972 unsafe_shutdowns : 19 media_errors : 0 num_err_log_entries : 52 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0

test@test-ubuntu:~$ sudo nvme smart-log -n 0 /dev/nvme0n1 Smart Log for NVME device:nvme0n1 namespace-id:0 critical_warning : 0 temperature : 37 C available_spare : 100% available_spare_threshold : 5% percentage_used : 97% endurance group critical warning summary: 0 data_units_read : 425,139 data_units_written : 525 host_read_commands : 4,964,108 host_write_commands : 4,096 controller_busy_time : 1 power_cycles : 72 power_on_hours : 11,972 unsafe_shutdowns : 19 media_errors : 0 num_err_log_entries : 52 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0

What I'm trying to determine with these commands is whether the drive is responding properly with all namespaces (ffffffff) on this log or if the specific namespace must be specified when reading it. That may explain the behavior and wrong values.

If that doesn't do it, then my next idea is that it is some sequence of commands that is causing strange behavior from the drive. For example, trying to read feature information before reading this log may cause something strange to happen when the data is read.

NavCC commented 3 years ago

Hello @vonericsen

Apologies for the delay finding my self under the pump at the moment, hopefully I'm replying fast enough to hold your interest!

Please find the two requested outputs in separate log files

smartlog.log smartlogn0.log

Appears to be lacking multiple pieces of output that you got from your nvme drive.

vonericsen commented 3 years ago

Thanks for the logs @NavCC! And no worries on the delay.

Looking at those outputs, that test of changing the namespace value didn't resolve it either. This also tells us the problem is consistent and not limited to openSeaChest.

I'm not sure what the solution is, and walking through the CrystalDiskInfo code, I didn't see anything unique there either, so I'm not sure why it reported correctly in there, but not in NVMe CLI or openSeaChest.

The only other thing I can think of is related to command order, or just getting lucky with whatever is going on in the drive. The other thing to try would be doing a sequence to see if that affects the returned data. Looking through crystal disk info, it seems to do these two commands only (from what I can tell):

  1. Controller identify
  2. Read Log (SMART/Health)

You can do this in nvme-cli with these commands and if it doesn't return massive numbers, that means we can implement this same kind of this as a workaround:

  1. nvme id-ctrl /dev/nvme???
  2. nvme smart-log /dev/nvme???

Can you try this a few times and see if this reports more accurate data for power on hours, total reads, total writes, etc?

Also, I did find that there is a firmware update available for this drive, but the readme makes it sound like it will erase the drive to do it...which is not great. XPG S40G downloads. This does not describe what it fixes unfortunately, so I cannot be sure that this would fix this reporting either, but it could.