JoeSchmuck / Multi-Report

FreeNAS/TrueNAS Script for emailed drive information.
43 stars 0 forks source link

[FR] Add a check for GPT Partition Table validity (using `sgdisk -v /dev/...`) #1

Closed Sophist-UK closed 4 months ago

Sophist-UK commented 5 months ago

I recently had an issue where the Primary Partition Table of my boot drive became corrupted and my TrueNAS SCALE system wouldn't boot.

Anyway, I eventually diagnosed this and used sgdisk to copy the data from the backup partition table.

I asked about tools to prevent or mitigate this and @stux suggested that it might be something that could be added to Multi-Report.

In essence, to check that the GPT Partition Tables are still valid you need to run sgdisk -v /dev/... against every disk and check if the return code is 2 or not with 2 indicating a Partition Table issue.

(Or maybe you check for return code not zero and report a variety of errors).

Is this something that you can easily add to Multi-Report?

(I would try to do this myself and submit a PR, but I am not a bash expert and I find the length of the script file somewhat daunting, and also I am unclear what Configuration file changes and Advanced Config UI changes would be needed if any.)

P.S. This has prompted me to implement Multi-Report on my system, and that then prompted me to set SCT on my HDD RAIDZ1 pool which I wouldn't otherwise have known about and then enable this functionality in your script. So your script is already a serious win for me - THANKS.

JoeSchmuck commented 5 months ago

Let me look into it. It would definitely qualify as a useful tool to add.

And thanks for the suggestion.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Sophist @.> Sent: Tuesday, April 16, 2024 12:52:54 PM To: JoeSchmuck/Multi-Report @.> Cc: Subscribed @.***> Subject: [JoeSchmuck/Multi-Report] [FR] Add a check for GPT Partition Table validity (using sgdisk -v /dev/...) (Issue #1)

I recently had an issue where the Primary Partition Table of my boot drive became corrupted and my TrueNAS SCALE system wouldn't boot.

Anyway, I eventually diagnosed this and used sgdisk to copy the data from the backup partition table.

I asked about tools to prevent or mitigate thishttps://forums.truenas.com/t/gpt-partition-table-corruption-on-boot-disk/1371/8 and @Stuxhttps://github.com/Stux suggested that it might be something that could be added to Multi-Report.

In essence, to check that the GPT Partition Tables are still valid you need to run sgdisk -v /dev/... against every disk and check if the return code is 2 or not with 2 indicating a Partition Table issue.

(Or maybe you check for return code not zero and report a variety of errors).

Is this something that you can easily add to Multi-Report?

(I would try to do this myself and submit a PR, but I am not a bash expert and I find the length of the script file somewhat daunting, and also I am unclear what Configuration file changes and Advanced Config UI changes would be needed if any.)

— Reply to this email directly, view it on GitHubhttps://github.com/JoeSchmuck/Multi-Report/issues/1, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA6522RFJNQJ6KPQZOCZCT3Y5VJONAVCNFSM6AAAAABGJXJL42VHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2DMNJQGEYDENI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JoeSchmuck commented 5 months ago

I took a quick look at the tool. I need to figure out how to incorporate it. It appears to only work of HDD/SSD, not NVMe or VM Disks. If there is a way to use on NVMe, I honestly have not looked into it. I had a few minutes to see what it produces and I should be able to use that. But is there an similar program for FreeBSD? Maybe. I will try to incorporate this on the weekend. I have one other item I'm almost finished updating, one it is working as expected, I can focus on this request.

JoeSchmuck commented 4 months ago

I looked a little further into this and here is the result I receive when running sgdisk on SCALE, on my NVMe drives:

root@truenas[/mnt/farm/scripts]# sgdisk -v /dev/nvme0 The specified path is a character device! Verification may miss some problems or report too many!

Problem: The CRC for the main GPT header is invalid. The main GPT header may be corrupt. Consider loading the backup GPT header to rebuild the main GPT header ('b' on the recovery & transformation menu). This report may be a false alarm if you've already corrected other problems.

Problem: The CRC for the main partition table is invalid. This table may be corrupt. Consider loading the backup partition table ('c' on the recovery & transformation menu). This report may be a false alarm if you've already corrected other problems.

Problem: The CRC for the backup GPT header is invalid. The backup GPT header may be corrupt. Consider using the main GPT header to rebuild the backup GPT header ('d' on the recovery & transformation menu). This report may be a false alarm if you've already corrected other problems.

Caution: The CRC for the backup partition table is invalid. This table may be corrupt. This program will automatically create a new backup partition table when you save your partitions.

Problem: The main header's self-pointer doesn't point to itself. This problem is being automatically corrected, but it may be a symptom of more serious problems. Think carefully before saving changes with 'w' or using this disk.

Problem: main GPT header's current LBA pointer (1) doesn't match the backup GPT header's alternate LBA pointer(0).

Problem: Disk is too small to hold all the data! (Disk size is 0 sectors, needs to be 0 sectors.) The 'e' option on the experts' menu may fix this problem.

Problem: Main partition table appears impossibly early on the disk. Using 'j' on the experts' menu may enable fixing this problem.

Warning: There is a gap between the main metadata (sector 1) and the main partition table (sector 0). This is helpful in some exotic configurations, but is generally ill-advised. Using 'j' on the experts' menu can adjust this gap.

Warning: The size of the partition table (0 bytes) is less than the minimum required by the GPT specification. Most OSes and tools seem to work fine on such disks, but this is a violation of the GPT specification and so may cause problems.

Problem: GPT claims the disk is larger than it is! (Claimed last usable sector is 18446744073709551584, but backup header is at 18446744073709551615 and disk size is 0 sectors. The 'e' option on the experts' menu will probably fix this problem

Identified 9 problems! root@truenas[/mnt/farm/scripts]#

AND I get an exit code of "0". I should have had something other than a "0". I even tried to leave out the "/dev/sda" to cause an invalid command, or include a switch that does not exist, still a "0" return value.

I find this a little problematic. I would like to include checking of the validity of the partition table however I'm not certain I can at this moment. This would only be for SCALE unless it becomes included in CORE, there is a port for it.

First things first, make sure we are getting a proper return value.

In the meantime you could create a small script to run this command and if it exits with anything other than "0", have it send you a small message. If you get that little script working, maybe it would clue me in on what I'm missing. And I'm also curious if this is ZFS related. I just don't know.

Sophist-UK commented 4 months ago

Eeeeek!!! This is so fundamentally flawed it is almost unbelievable. I can only assume that it is getting one piece of information (like the disk characteristics) on an API that provides different format for NVME and it is not interpreting that information correctly, and this is leading to it looking in the wrong place for the partition table which then appears to be corrupt. (The alternative explanation is that absolute sector reads on an NVME disk are different from other disks, and that seems far less likely.) But NVME has been around for more than a decade so this seems unlikely too.

According to the changelog gdisk appears to be being actively maintained. Are you using the latest version?

Also beware of unattributed, unmaintained old clone version such as this one based on 1.0.1. My ubuntu which is reasonably up to date is running 1.0.8 (from June 2021) when there is a later 2022 version (we can allow the Feb 2024 version as being too new).

But I am still surprised at this result.

Turning now to the return codes, I cannot imagine that you are doing the wrong thing to determine what they are, but equally I cannot imagine that they are actually returning zero in all the situations you covered (which seems very abnormal for a linux command).

Are you willing and able to reach out to the author of gdisk and get support?

JoeSchmuck commented 4 months ago

I am using sgdisk v 1.0.9 which is on TrueNAS 24.04-RC.1. That is the version I must use.

This script being a very part time hobby, I am willing to let others do some of the leg work as I have done a lot already. However, if you can provide me a simple bash script that works on your machine and send it my way, I'm willing to see if I can make it work. Feel free to use my email address for this project: joeschmuck2023@hotmail.com and we can exchange information much quicker.

I did a little more with this and here is the result, I need to specify the namespace and partition for the nvme, however the results are still, eh. I am still unable to run a verify.

`root@freenas[/mnt/farm2/scripts]# echo "$(sgdisk -v /dev/nvme0) Results="$?

The above command produces results and returns a "2" value and 9 problems, even though I'm fairly certain I do not have 9 problems. When I run the command on a normal drive 'sda' then I get No problems found an a result of "0".

I have no idea why I get the proper exit code now and didn't previously. This is on SCALE 24.04 RC.1. I need to switch back to CORE, I thought it didn't have sgdisk, I could be wrong there as well. But it must be sgdisk as it supports scripting where gdisk apparently does not.

If I get this implemented, it will not be immediately I don't think, could take a few weeks as I need to figure out the best place for it, and I need to test the crap out of it on something I can destroy without fear. Sgdisk is a very powerful and destructive command if used improperly.

May I reach out to you if I have any questions about sgdisk and which results are acceptable and more importantly which ones are not acceptable. This is the manpage I'm using as it is specifically for Debian.

Sophist-UK commented 4 months ago

sgdisk is the scriptable version of gdisk from the same developer. You can script the interactive gdisk by piping in the commands you would enter manually and analysing the text output - but that is orders of magnitude more complex - but there are lots of things you can only do with gdisk and not sgdisk like copy backup partition table over the normal one etc.

I think you are (of course) absolutely right to use the version of sgdisk that comes with TrueNAS which is probably th eone which comes with Debian.

As discussed previously (elsewhere?) I do not think that a repair should be scripted - just a notification. And AFAIK sgdisk does not write anything to disk unless it is explicitly told to - so I doubt you will screw up the system by using sgdisk but you (obviously) want to avoid false positives from the script resulting from sgdisk and anything which is designed to flag up issues is much more difficult to test because you have to create these (sometimes obscure) issues to test it.

I am glad the RC issue is sorted - I thought it was odd that you were getting a zero RC with 9 errors.

I think I know what the issue is with nvme as I am able to generate the exact same issue with my NAS box on HDD:

root@TerraMaster[~]# echo "$(sgdisk -v /dev/sda) Results="$?

Caution: Partition 1 doesn't end on a 8-sector boundary. This may
result in problems with some disk encryption tools.

No problems found. 6 free sectors (3.0 KiB) available in 1
segments, the largest of which is 6 (3.0 KiB) in size. Results=0

root@TerraMaster[~]# echo "$(sgdisk -v /dev/sg0) Results="$?

The specified path is a character device!
Verification may miss some problems or report too many!

Problem: The CRC for the main GPT header is invalid. The main GPT header may
be corrupt. Consider loading the backup GPT header to rebuild the main GPT
header ('b' on the recovery & transformation menu). This report may be a false
alarm if you've already corrected other problems.

Problem: The CRC for the main partition table is invalid. This table may be
corrupt. Consider loading the backup partition table ('c' on the recovery &
transformation menu). This report may be a false alarm if you've already
corrected other problems.

Problem: The CRC for the backup GPT header is invalid. The backup GPT header
may be corrupt. Consider using the main GPT header to rebuild the backup GPT
header ('d' on the recovery & transformation menu). This report may be a false
alarm if you've already corrected other problems.

Caution: The CRC for the backup partition table is invalid. This table may
be corrupt. This program will automatically create a new backup partition
table when you save your partitions.

Problem: The main header's self-pointer doesn't point to itself. This problem
is being automatically corrected, but it may be a symptom of more serious
problems. Think carefully before saving changes with 'w' or using this disk.

Problem: main GPT header's current LBA pointer (1) doesn't
match the backup GPT header's alternate LBA pointer(0).

Problem: Disk is too small to hold all the data!
(Disk size is 0 sectors, needs to be 0 sectors.)
The 'e' option on the experts' menu may fix this problem.

Problem: Main partition table appears impossibly early on the disk.
Using 'j' on the experts' menu may enable fixing this problem.

Warning: There is a gap between the main metadata (sector 1) and the main
partition table (sector 0). This is helpful in some exotic configurations,
but is generally ill-advised. Using 'j' on the experts' menu can adjust this
gap.

Warning: The size of the partition table (0 bytes) is less than the minimum
required by the GPT specification. Most OSes and tools seem to work fine on
such disks, but this is a violation of the GPT specification and so may cause
problems.

Problem: GPT claims the disk is larger than it is! (Claimed last usable
sector is 18446744073709551584, but backup header is at
18446744073709551615 and disk size is 0 sectors.
The 'e' option on the experts' menu will probably fix this problem

Identified 9 problems! Results=2

According to the wikipedia documentation on /dev /dev/sgx is the device controller (character device) whilst /dev/sda is the block device and /dev/sda1 is the first partition block device on the disk, and the equivalent for NVME is /dev/nvme0 for the device controller (character device) whilst /dev/nvme0n1 the block device and /dev/nvme0n1p1 the first partition block device.

So what happens if you do echo "$(sgdisk -v /dev/nvme0n1) Results="$??

JoeSchmuck commented 4 months ago

root@freenas[/mnt/farm2/scripts]# echo "$(sgdisk -v /dev/nvme0n1) Results="$?

Caution: Partition 1 doesn't end on a 128-sector boundary. This may result in problems with some disk encryption tools.

Caution: Partition 2 doesn't end on a 128-sector boundary. This may result in problems with some disk encryption tools.

No problems found. 221 free sectors (110.5 KiB) available in 2 segments, the largest of which is 127 (63.5 KiB) in size. Results=0

This of course looks much better than yesterdays test. I could have had all kinds of things messed up on my system at the time.

So I'm good with incorporating this added feature, triggering a Caution message, and I should be able to change the Device ID background to Red to flag it to people. It should be pretty easy to incorporate however I need to ensure I select the correct device, which should not be an issue either, it is just something I need to test.

I understand the using '-v' is just a verify but anytime I add a command that I'm not familiar with, I test. The one thing I never want to be blamed for is destroying someone's data. I'm about to retire from the world of strategic weapons, we take everything into consideration. I try to never rush into something.

Sophist-UK commented 4 months ago

Please don't get me wrong - I fully understand and agree with your cautious approach - and that is why I stepped away from any automated attempts to fix any errors. It is hard enough to ensure that reporting errors when they occur is working correctly (because most systems don't have these errors), without having to make sure that the fix is not going to make things worse rather than better.

For example in my situation with a corrupt primary partition table and a valid backup one, you have one shot to copy one table over the other in the correct direction; if you have a bug that does it in the wrong direction you have made everything 1,000,000 more difficult to fix. (I can't remember which rocket exploded on the launch pad due to a period character having been entered as a comma in the computer code (or vice versa). But there are several examples of this sort of thing. Another - a test pilot had to eject from a fly by wire fighter jet the very first time he tried a roll because when he got past 90 degrees of roll, the plane flipped and literally wouldn't turn back over - this time and issue of +/- signs being switched.)

JoeSchmuck commented 4 months ago

I haven't forgot about this. I believe I have finished Multi_Report v3.0.2 and have it out for a few folks to test. Now I have the time to see if I can incorporate sgdisk.

My first problem, gdisk or sgdisk is not on TrueNAS CORE. In the past is was but not anymore, or at least in 13.0-U6.1. I do my best to have the same features in both TrueNAS versions. I may not be able to do that here, unless I can manually add either sgdisk (preferred) or gdisk. I will need to examine the executables and the trick is, if there are dependencies missing, that is a deal breaker. I can still install the program but then I'm making huge changes and I'm not going to be accused of causing someone to lose data, even though I would be certain my changes would not cause that, but I'm extra cautious.

Problem two is getting around the NVMe drive issue. Maybe I just ignore NVMe drives for now and maybe sgdisk will get an upgrade, or not.

Right now I know I can implement for SCALE and HDD/SSD only.

Sophist-UK commented 4 months ago

What is the issue now with NVMe? Didn't the addition of "n1" to the device name make it work?

I agree that ideally Core would have sgdisk installed already - and that you shouldn't install it yourself - so if it really isn't installed on CORE then either you give manual instructions or don't support it. I am unclear why it would be removed. I also haven't checked whether there is even a FreeBSD version.

Sophist-UK commented 4 months ago

And don't worry about timescales.

JoeSchmuck commented 4 months ago

All done. However TrueNAS CORE 13.0-U6.1 does not come with gdisk/sgdisk. I had to build my own and give the user the option to install them. V3.0.2 should be out shortly as a bug fix, your GPT Partition Checking, and it also has a few new features which someone tracking data usage would find valuable. I would have been done sooner however I have one SSD that does it's own thing, once I realized it was goofed up, I was able to complete the project. I have someone else taking it for a spin to make sure I didn't break something. A second, third, or more set of eyes helps.

I had a difficult time verifying the sgdisk command really works. I'd need to intentionally cause a failure to test it properly. And the default is for it to be off, the end user will need to set Partition_Check="true" to make it run. That was added incase the thing flagged a false problem so it could be disabled to stop the messages.

Sophist-UK commented 4 months ago

That is brilliant. Thank you so much.

Sophist-UK commented 4 months ago

P.S. On the TrueNAS forums I am Protopia in case you hadn't guessed.

JoeSchmuck commented 4 months ago

The Avatar gave it away :)

JoeSchmuck commented 4 months ago

Let me know if the Partition Check works for you. There was some oddities when dealing with NVMe drives however I think I have them all sorted out. By default the partition checker is turned off. I really need to generate a survey on what the end users want and more importantly what I can remove. I'd like to slim the script down a bit. I'm working on a complete rewrite because making changes to the simulation sections has become a nightmare.