Spearfoot / disk-burnin-and-testing

Shell script for burn-in and testing of new or re-purposed drives
Other
853 stars 106 forks source link

How to run the script? #3

Closed nwillems closed 4 years ago

nwillems commented 4 years ago

First of all - thank you for publishing this little gem.

I'm new to this NAS game thing, and bought used disks(16x 2tb), and wanted to ensure that I know what I've got on my hands.

So I've made myself a bootable usb stick with ubuntu 18.04, ensured that all tools are available and fetched this script. It ran very fast at first, and I wondered, "large disks may take a long time", hmm what constitutes large disks?

Then I read the entire readme, carfeully, and lo and behold, hidden there in the middle, "disable dry run" shame on me for not RTFM. But bubling this to the top, would be very helpful for newcomers.

Lastly, I derived a "clever" method of running the tool for many disks(since I have some drives, and didn't want to sit and wait for it to finish).

ls /dev/sd[a-z] | cut -d'/' -f3 | sudo parallel -I{} ./wrapper.sh {}

# Wrapper contains this:
#!/bin/bash -xe
./disk-burnin.sh ${1} > logs/${1}.log

What I'm in doubt about then is, is this a good method? Does the parallel running degrade performance or in any way prevent a valid test? I Know this also tries to test my cd drive on /dev/sdr but hey, worst-case it fails :-) From this, I also feel that it would be nice if the script accepted a full device path, rather than a device name-ish - eg to me it would be more logical to look in /dev/disk/by-path/ to figure out which disks to test.

I would be more than happy to submit a PR with these changes, I just didn't want to do too much without understanding what I'm actually doing.

EDIT, More questions: It seems the polling logic is not working with version smartmontools release 6.6 dated 2016-05-07 at 11:17:46 UTC, due to a changed output format(this might be ubuntu 18.04 related). Also, in the mentioned version there is an option to do the task in the foreground, is there a particular reason to not doing this?(maybe because it didn't exist)

So in summary, the questions are:

I hope this is at least somewhat helpful feedback. :-) /Nwillems

Spearfoot commented 4 years ago

Thank you, sir!

What's a 'large disk'? That's subjective I suppose. Once upon a time I remember when a 512GB disk was large. Guess that shows I'm getting old. The point is: the larger the disk, the longer it will take to burn it in. On the order of a week or more for >10TB disks.

You make a good point about the dry-run flag. Perhaps I should try to make it more prominent. Then again, it's probably a good idea for users to study this tool closely before using it.

No problem at all running several instances of this script simultaneously. I do all the time (see below). And I see no reason not to run the smartmontools in the foreground.

Given the fact that this is a destructive script, I don't think it's a good idea to automate running it on all of the system disks, as you seem to suggest with your code example. I think it best for the user to be very conscious of what they're about to do, and which disks they're about to do it to.

That said, I find it convenient to use tmux to run several simultaneous tests when I have a large batch of disks to burn in. Here's a script to do that:

#!/bin/sh

drives="sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn sdo sdp sdq"

for drive in $drives; do
    tmux new-session -d -s ${drive} '/root/burnin/disk-burnin.sh ${drive}'
done

I periodically check on the sessions by running, for example:

tmux attach -t sdf

Later detaching from the session by pressing Ctrl-b-d

Thanks again!

nwillems commented 4 years ago

Thats a nice way to use multiple tmux sessions. Would be nice with an examples section on how you can run the script. I should also say that, I've checked and double checked which disks would be included for that script, but I like your explicit selection better.

On "dry-run flag", I completely agree, people should be aware what they are about to do. I guess my point about newcomers, equally applies to "old farts", I know I've been in a situation where I'm using TrustyTool^{tm} but it doesn do the thing it usually does, until 2hours later it dawns on you, "Ahh that config flag, hidden over there", and that usually stems from just reading the first 2 lines of the manual and then running tool --help.

On large disks, maybe we could collect a little data on run-times for different disk-sizes/models. Eg this forum thread mentions 24hours for badblocks on 2TB drives, https://www.ixsystems.com/community/threads/hard-drive-burn-in-testing-discussion-thread.21451/ My drives reports that the long smart test should take ~5hours(Toshiba SAS - will update with info when I have it) And don't worry about age :-), I remember getting a 10GB disk for my desktop, it was too big for the MoBo to handle, so was split into multiple partitions.

If you need feedback on a wording or structuring, feel free to ping/tag me.

Thanks for the quick reply and good info, and THANKS for a nice script that just works 👍

Spearfoot commented 4 years ago

Hmm... I checked some logs, and here are the results:

And that's skipping the first extended SMART test. If you include that step, you'll have to add another 6-12 hours.

I remember 10GB hard disks! The first hard disks I bought were that size: we installed them in our original IBM PCs and thought -- Wow! We're never going to run out of disk space!

Spearfoot commented 4 years ago

Anyway, you may notice that I edited the README and tried to highlight the 'dry run' issue. I'll tackle some tmux examples when I have some spare time.

Thanks again!

nwillems commented 4 years ago

My test results are coming in by now. And I've noticed a few things:

My suggestion would be to, just not handle logging, but instead ensure everything goes to stdout. The default behaviour of badblocks(according to their manpage) is to just output the list of bad blocks to stdout, when you don't specify -o. The -v-option is also specified, which means that the badblocks info is also printed to stderr, maybe this should be redirected to stdout, or just omitted if logging is "not handled", I'm not sure.

Another way around the issue is to just use a case-insensitive match for grep, on line 160, adding -i to grep. PR coming shortly.

A whole third avenue of attack, would be to use something like mktemp tmp.${Drive}.XXXXXXXXXX, and then report on stderr the mapping between device and temporary file. Allowing people to handle logging themself, or use the temporary file. mktemp is part of gnu coreutils, so I would guess its available on most platforms, but I'm not sure.

Once again, thank you for providing this script and thanks for the very fast replies and helpful insights. Its a pleasure getting into this whole NAS community and feeling welcome, thanks!

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              MG03SCA200
Revision:             0108
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000039528d8b624
Serial number:        Y3H0A03NFTP8
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Thu Apr  9 13:58:03 2020 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
Spearfoot commented 4 years ago

I tried redirecting all output to stdout, but it wasn't feasible because badblocks emits a tremendous amount of information because it gives 'real time' progress as it runs.