Open mholl1983 opened 4 years ago
Spare drive is out of the mix now, so the 4 original drives are in place!
The other arrays look fine and are in "activesync" state.
huh, what type of raid is this? if it's not RAID0 you can probably just re-add the partition and have it re-sync. If it is RAID0 i'd have to look up what to do to be honest.
Which array, you mean? It's RAID5.
The degraded array.
I don't know why it would have lost the partition, I also don't know how to force it back. I think you're better off re-adding it likea new partition and have it rebuild. I've never actually done it with Raid5 I think it's just "--add" but you should probably get some sleep and do a little research before trying it.
You should also install smartmontools and check out the health of the drives while we're at it.
"get some sleep and do a little research before trying it" -- good advice as usual. :) WIll do just that. And thanks for mentioning smartmontools; will make sure I get that running next time too. Cheers!
I suppose you could also see if it magically assembles on a reboot now that it's in mdadm.conf.
Sadly, it didn't magically assemble but good thought!
Morning! Having another go shortly. :) Looks like this command set might be a clue:
mdadm --stop /dev/mdX mdadm --assemble --force /dev/mdX /dev/sdX /dev/sdY
(from https://superuser.com/questions/993259/why-is-my-raid-1-disk-inactive)
So, "X" would be 2 given that it's in the partition name. But what would "Y" variable be in this scenario?
I'm also guessing I can't proceed with samba or DLNA server setup until i resolve this inactive partition?
It’s probably partition 6. You can check with “—detail” and see what it lists for the other drives.
mdadm supports wildcards of various types so you could use something like “/dev/sd[abcd]6” instead of listing the partitions.
If you wanted you could set up samba/etc without resolving this and just create shares within your rootfs, but I would think this would be the top priority.
That and looking at the SMART data to make sure the drives are okay.
got it. ran the command and it worked but there's one removed drive:
getting the SMART data momentarily.
SMART data looks good!
hmm, just did an --examine --scan and i don't see inactive partitions now:
did the mount array stuff and just verified the free space. all looks ok to me but just wanted to share in case anything jumped off the page to you:
This is the part where things can go surprisingly wrong, so be careful. Have you added the array to your fstab yet? If you have you'll need to test it and update initrd before rebooting.
I'm just going off the top of my head, so you should probably take a look at some guides/documentation regarding adding drives to degraded RAID5 arrays....but
I think all you need to do is add that partition with something like
mdadm --add /dev/md2 /dev/sda6
and the then you can watch the progress with "--detail"
It's rebuilding as we speak -- slowly but sure. Exciting! Thanks again! Will add the array to fstab next.
I'm doing this on a machine where I have to sign into VPN from time to time. Probably quite a n00b question, but if I have to close my SSH session, will the rebuild continue uninterrupted and I can check the status later when I have an active connection?
Yeah, the rebuild will continue on it's own if you get disconnected.
I'm pretty sure it would even resume after a reboot, but i wouldn't recommend doing that.
Cheers! I'll just let it run in the background. If this works out, I'll be ecstatic and can finally start adding some of features I'm looking to use.
45% and counting! A labour of love. :)
While I wait, do I only need to add md2 (seems to be my main array based on the space) to fstab? no additional entries?
The installer should have added all the other entries for you. You should only have to add the array.
remember before you reboot:
umount /mnt/array
mount /mnt/array
df -h
update-initramfs -u
###only reboot if the array shows up in df and there aren't any errors.
I'm curious... is the CPU pegged at 100%?
Will do! And not sure about the CPU. how can i check?
top
Was about to check but I got this unfortunate result with the raid rebuild. :(
I believe the rebuild got as far as 80%
so it seems to think one of the other drives failed during the rebuild.... that's .... a nightmare.
Makes me wish we'd looked at the array before even transferring the installer. It would make more sense to me that the first drive had fallen out of the array before you started but we can't really know now.
I'm not really sure how to proceed as I've never used raid5 or really had any raid failures before. I'd definitely recommend doing some research before proceeding.
what I would probably do : -read a bunch of pages about this type of recovery, learn how to force a failed drive back into the array (I think that's possible) -get another drive of the same size (if that's feasible) -use ddrescue to try to clone the "faulty" drive to the good drive. -swap in the cloned drive and try to force start the array (could use the original drive if that isn't an option). -retry the rebuild and hope for the best.
Thanks again. Before I proceed further, would you recommend I do a backup in any specific way at this point? wondering if my box is functional enough to add Samba then transfer files onto another drive (I have a 4 tb USB drive). Hindsight is 20/20 and I wish I was patient enough and did a backup first.
As far as I know (not having used raid5).... one downside is that if you have two drives fail (really the number of parity drives +1) there's no recovery. For a plain drive you could clone it and use recovery software and usually get back a lot minus whatever bad sectors, unless it fails completely. But with raid 5 the data is XOR'd at the block level which would leave every byte missing data without the additional piece (I think).
You bring up a good point though (something to research). If you do try force adding it back and starting the array you may want to focus on backing up your data rather than adding in another drive.
You could hire professionals to do data recovery for you, I assume they do some combination of the things we've talked about plus they can probably try to physically repair the drive etc. I assume this is terribly expensive and haven't looked into it.
I don't think all hope is lost but it's not a good state to be in.
A good lesson here. Haven't given up hope yet either. Seems this person hit a similar situation and was able to get another drive going, enough to get the data off. https://unix.stackexchange.com/questions/430529/help-recovering-a-raid5-array
That sounds exactly like your situation
May've got a lucky break! I was able to stop and then start the array and i got 3/4 drives:
Nice! I hope your backing up goes well
Thanks so much. I'm almost at a loss as to how to do such a basic thing like that now. lol. Any tips? I'd like to keep my drives in place in the TS (of course) and try to get the files off via my Windows system onto the 4 tb USB drive attached to it. Is the Samba server my only option? Any value in plugging the USB drive into the TS? I know that was possible in the stock firmware...
Going to sleep on it again and say thanks again for your help!
Part of me is tempted to go back to stock firmware, get my SMB1 share going out of the box, and then backup my files that way, but I feel that's probably yet another calculated risk at this point.
An update: I'm now backing up my files! I'm ecstatic.
I mounted my raid array and then installed webmin and samba. From webmin, i set up the samba share in a few steps. Transferring as we speak and will leave this up and running until everything completes. PHEW! So glad I stuck w/ Debian because this is working so well! Have a good night and hopefully it's onward and upward from here!
Nice!
Maybe I should t be so negative about webmin. I think trying to install the mdadm module is what locked the whole thing up. I still don’t recommend letting it manage your drives but samba should be ok.
Hmmm, it worked for a few hours and then I got an i/o error. :( The transfer speed was quite slow, too -- the files are going over a 1 gb LAN to a USB drive attached to my desktop but the ports are USB 2.0 so that's a bottleneck. Transferring a few files directly to my desktop seems to be slow too.
Any chance I could attach the USB drive directly to the TS and transfer somehow that way?
Hold that thought. Am trying FTP and it seems to be holding steady -- still not the fastest of speeds but it's transferred over more than the Samba share last night.
If you'd like, I can close this thread as I've taken a lot of your time. I may have more follow up questions but I don't want to bog you down with an endless thread. Let me know and thanks for your time! If I can ever help with documentation on a volunteer basis, I'd be happy to!
We can keep it open for now. You’ll probably learn some things about webmin that are worth documenting in the near future.
I assume the i/o error was it experiencing the same issue that caused the resync to fail. I wouldn’t think ftp vs samba would make much difference though ftp being unencrypted may be faster.
Hopefully it’s just some bad sectors in an unimportant file and you’ll get back most/all of your data.
Once you’re done backing up I can show you how I typically test/condition drives which will hopefully give us some data about the “bad” drive.
Appreciate that. The transfer is still coming along nicely (slowly but surely) and it's going straight to my drive which will be an integral part of my backup plan going forward!
Talking of which, I definitely want to test the health of the drives as per your suggestion and remove the one that's bad. I don't want to go with RAID5 again. My understanding is I can either have a RAID array with 2 or 4 drives. I'm wondering what level of RAID you'd recommend for my use case (strictly a DLNA + Samba server that serves media to other devices in the household)?
it depends on your situation.
It also depends a bit on how the device handles things. That's why I asked about cpu usage, I'm curious how much cpu rebuilding the array was taking because RAID5 requires running an XOR operation every time it writes something which can use up a lot of CPU on older devices. This model has dedicated hardware in the CPU to speed that up but it could be a limiting factor.
It sounds like you're already planning to separate your raid strategy from your backup strategy which is good.
personally I rely on backups for all of my recovery. I essentially have one big RAID0 array in my PC and another one in my NAS which I keep up to date with some RSYNC updates. This is a somewhat dangerous way to operate since a single drive failure would make me lose anything that wasn't backed up yet but it maximizes performance and allows each array to be as big as possible. It makes dealing with a single drive going bad a pain because I have to rebuild the whole array when that happens.
If you're okay with how much space you'd have you could move to RAID1+0 which would allow you to recover from a single drive failure while giving you better performance but less total space. You could try raid5 again with some newer drives to get more space, but as we've seen this week you'll need a backup either way.
one other thing you should look into, maybe as a documentation homework assignment....
mdadm can email you when something happens with a raid array. This way you'd get an email right away if your array was degraded which can be a life saver. The application I use to handle this is no longer supported in Debian so I'm not sure my walking you through that would be a great solution. You could see if webmin has something for doing that or research alternatives, it's make a great guide to add to the wiki.
I'll definitely look into that suggestion for docs! Good idea. I'll let you know how it goes. Want me to just capture a draft separately and share with you? Or is there a way I can post it directly to the wiki and then you can provide feedback/approve/reject?
Took a little break after I got the backups underway. I recovered roughly 80% of the files; I'll try for the remaining ones today but I'd chalk this up as a lucky break, for sure. I'll look at rebuilding the array with perhaps a different architecture. If you have time, let me know about the test steps you follow for checking drive health. I'd like to do that before I actively look at a new raid array. Would like to know how to cleanly wipe the drives, too, and just "start over". :) Thanks again!
SMART data is great but it has limitations. One thing to understand is that the drive often won't know there's a problem with a sector until it tries to read it and generally won't try to correct/replace a sector until you try to write it. This means the best way to figure out the health of the drive is to read/write every sector and look at the SMART data (and watch for IO errors).
Once you're done recovering data, I think you should do the following:
update-initramfs -u
to make sure you don't have reboot issuesmdadm --stop /dev/md2
mdadm --zero-superblock /dev/sd[abcd]6
then for each disk (1 at a time)
remove the drive from each raid 1 array
mdadm --fail /dev/md0 /dev/sdd1
mdadm --remove /dev/md0 /dev/sdd1
mdadm --fail /dev/md10 /dev/sdd2
mdadm --remove /dev/md10 /dev/sdd2
mdadm --fail /dev/md1 /dev/sdd3
mdadm --remove /dev/md1 /dev/sdd3
use the badblocks utility to write every sector of the drive and test it (takes a long time)
badblocks -b 4096 -c 256 /dev/sdd
if it finishes without throwing i/o errors, look at the SMART data and check for reallocated sectors
smartctl -a /dev/sdd
if it has no reallocated sectors and doesn't throw io errors, re-add to the arrays
mdadm --add /dev/md0 /dev/sdd1
.....etc
watch and make sure they finish resyncing
Thanks. Steps were working out until I hit an error when i get to mdadm --fail /dev/md10 /dev/sdd2
I get:
mdadm: set device faulty failed for /dev/sdd2: No such device
Details on md10:
Raid Devices : 4
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Sep 25 13:03:30 2020
State : clean, degraded
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Name : TS-XEL-EM02C:10
UUID : 42d48385:163a55cd:7bc70b78:e5487c19
Events : 6764
Number Major Minor RaidDevice State
7 8 5 0 active sync /dev/sda5
- 0 0 1 removed
4 8 53 2 active sync /dev/sdd5
5 8 37 3 active sync /dev/sdc5
the raid info in webmin as well, FWIW.
I've probably got some of the device names wrong, feel free to substitute with the correct ones.
it looks like sdb is already missing form all the arrays, you could start with that one if you want.
Thanks! So just to be clear, the goal is to fail and remove all drives from all existing arrays before wipe/rebuild steps?
I have stock firmware and wanted to try the Debian steps to gain SMB2 for shares. I was able to send the two installer files over ACP commander to my TS-XEL02. I picked the appropriate files for that model and renamed them accordingly.
Then I did the command to restart and now the device does part of the power cycle but then shuts down after a minute. I'm not able to proceed to the installation steps.