Open mitar opened 6 years ago
And use smartctl
tool to see the status before and after dd
so that you know if you really fixed the problem.
Are you saying to physically remove the disk and do this from a different machine, or do you mean something from a shell?
Everything from the shell. You just remove it from md (software) RAID using md command. See cat /proc/mdstat
to see how raid devices are made and which drives are there. Read on md raid.
I suggest that before you do stuff, write a proposal how to do it here. Oh, and there are commands for this in the bash history probably, when I was doing this last time.
The tail end of something with /dev/twa0
is at the very beginning of the .bash_history
archive I made. Based on that + man pages, I ran the following read-only commands:
sudo lshw | less
, relevant part:
*-storage
description: RAID bus controller
product: 9650SE SATA-II RAID PCIe
vendor: 3ware Inc
physical id: 0
bus info: pci@0000:01:00.0
logical name: scsi2
version: 01
width: 64 bits
clock: 33MHz
capabilities: storage pm msi pciexpress bus_master cap_list rom emulated
configuration: driver=3w-9xxx latency=0
resources: irq:16 memory:ec000000-edffffff memory:ea100000-ea100fff ioport:4000(size=256) memory:ea120000-ea13ffff
*-disk:13
description: SCSI Disk
product: 9650SE-16M DISK
vendor: AMCC
physical id: 0.6.0
bus info: scsi@2:0.6.0
logical name: /dev/sdh
version: 4.10
serial: N057468257C49C009001
size: 2793GiB (2999GB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: ansiversion=5 guid=0d8a65b1-a403-4a9d-bf34-eaab03adbee5 logicalsectorsize=512 sectorsize=512
*-volume
description: RAID partition
vendor: Linux
physical id: 1
bus info: scsi@2:0.6.0,1
logical name: /dev/sdh1
serial: ffb77c6e-f323-471d-8796-eb5037f27b31
capacity: 2793GiB
capabilities: multi
I think this is the relevant part because it's attached to the only device that mentions "3ware" and it says "disk:13". But, the disk size and volume capacity are much larger than 750 GB so I'm not sure of this. It would make more sense if it were /dev/sdg
, /dev/sdj
, /dev/sdk
, /dev/sdm
, /dev/sdn
, /dev/sdo
, or /dev/sdp
(corresponding to md7
, md6
, or md5
). Hopefully smartctl will tell me which it is. But I'll continue as if it's /dev/sdh
.
cat /proc/mdstat
output:
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md7 : active raid1 sdo1[0] sdp1[1]
732277568 blocks super 1.2 [2/2] [UU]
md6 : active raid1 sdm1[0] sdn1[1] 732277568 blocks super 1.2 [2/2] [UU]
md5 : active raid1 sdk1[3] sdj1[2] 732277568 blocks super 1.2 [2/2] [UU]
md3 : active raid1 sdl1[1] sdi1[0] 2929542976 blocks super 1.2 [2/2] [UU]
md2 : active raid1 sdh1[1] sdf1[2] 2929542976 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sdb1[3] sdc1[2] 2929542976 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sde1[2] sdd1[3] 2929542976 blocks super 1.2 [2/2] [UU]
unused devices:
This tells me `/dev/sdh1` is on `/dev/md2`, and so I assume `/dev/sdf1` is its mirror.
Proposal for how to proceed:
1. Record the output of `sudo smartctl -a -d 3ware,13 /dev/twa0`, which should give the current status
2. `sudo su`
3. `smartctl -a -d 3ware,13 /dev/twa0 | grep Current_Pending_Sector` (in case the output from 1 was long)
4. Should I run `smartctl -t long -d 3ware,13 /dev/twa0`? that is in your bash history, but since we already know the error message I'm not sure this is needed.
5. Resolve the pending sectors (by telling whatever's "pending" to fail): `mdadm --manage /dev/md2 --fail /dev/sdh1`
6. Remove the offending disk: `mdadm --manage /dev/md2 --remove /dev/sdh1`
7. Take another snapshot of `cat /proc/mdstat`: This time, I expect it to list `/dev/sdf1` as the only disk on `/dev/md2`, and to list `/dev/sdh` under "unused devices".
8. Override the disk with zeros: `dd if=/dev/zero of=/dev/sdh bs=1M oflag=direct,sync`. This block size should work if it's the same kind of disk as when you did this before, but I should probably look at `parted /dev/sdh` to be sure (I've never used `parted` but I'm familiar with `gparted`). `lshw` did say sector size of 512 and I'm not sure if that's the same thing.
9. On your third bullet point, where is the spare sector being swapped in from? the mirror disk? In any case, I should run `smartctl -a -d 3ware,13 /dev/twa0` again here and hope the output makes sense.
10. Reformat the disk back: `fdisk /dev/sdh` -> `n` for new partition, `p` for primary, defaults for first and last cylinder, and `w` to write and exit
11. Re-attach it back to RAID: `mdadm --manage /dev/md2 --add /dev/sdh1`
12. Repeat `cat /proc/mdstat` to ensure that everything makes sense.
Does that seem reasonable?
Bump? I think it got worse...
Before:
root@server3:/home/cloyne# sudo smartctl -a -d 3ware,13 /dev/twa0
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-130-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda ES
Device Model: ST3750640NS
Serial Number: 3QD0B5XJ
Firmware Version: 3.AEE
User Capacity: 750,156,374,016 bytes [750 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA/ATAPI-7 (minor revision not indicated)
Local Time is: Sat Oct 20 11:09:59 2018 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 202) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 105 088 006 Pre-fail Always - 65372609
3 Spin_Up_Time 0x0003 087 085 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 135
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2
7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 499263873
9 Power_On_Hours 0x0032 055 055 000 Old_age Always - 39562
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 137
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 061 044 045 Old_age Always In_the_past 39 (Min/Max 38/43)
194 Temperature_Celsius 0x0022 039 056 000 Old_age Always - 39 (0 13 0 0 0)
195 Hardware_ECC_Recovered 0x001a 062 047 000 Old_age Always - 228011716
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 16244 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
dd if=/dev/zero of=/dev/sdh bs=1M oflag=direct,sync status=progress
terminated with dd: error writing '/dev/sdh': No space left on device
, as expected. smartctl -a
after:
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 105 088 006 Pre-fail Always - 65372609
3 Spin_Up_Time 0x0003 087 085 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 135
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 2
7 Seek_Error_Rate 0x000f 087 060 030 Pre-fail Always - 501992622
9 Power_On_Hours 0x0032 055 055 000 Old_age Always - 39586
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 137
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 062 044 045 Old_age Always In_the_past 38 (Min/Max 38/43)
194 Temperature_Celsius 0x0022 038 056 000 Old_age Always - 38 (0 13 0 0 0)
195 Hardware_ECC_Recovered 0x001a 068 047 000 Old_age Always - 204145678
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 2
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 2
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
Since it still says Current_Pending_Sector as 2, I tried smartctl -t long -d 3ware,13 /dev/twa0
. The result: # 1 Extended offline Completed: read failure 90% 39593 1465138515
So, I tried dd'ing just that sector, with dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1465138515 conv=noerror,sync
. No errors, but the error counts are unchanged. Now running the long test again.
One important aspect is that the write size has to match the sector size. I think your sector size is 4k, not 512B?
Bump? I think it got worse...
And yes, now you also have Offline_Uncorrectable
, but the number is still low. But if you see that increasing, then hard drive really became bad. 2 is still an OK number, but yea, ideally it would be all 0.
I was going based on the top of the systemctl
output, which says Sector Size: 512 bytes logical/physical
. I did also try with bs=4096
and bs=1M
, and they all finished without errors. But the second long test still failed on the same sector:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 39845 1465138515
# 2 Extended offline Completed: read failure 90% 39593 1465138515
# 3 Extended offline Completed without error 00% 16244 -
root@server3:/proc/sys# dd if=/dev/zero of=/dev/ada4 bs=4096 count=1 seek=1465138515 conv=noerror,sync
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000267878 s, 15.3 MB/s
root@server3:/proc/sys# dd if=/dev/zero of=/dev/ada4 bs=1M count=1 seek=1465138515 conv=noerror,sync
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00212421 s, 494 MB/s
root@server3:/proc/sys# dd if=/dev/zero of=/dev/ada4 bs=4k count=1 seek=1465138515 conv=noerror,sync
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00023148 s, 17.7 MB/s
root@server3:/proc/sys# dd if=/dev/zero of=/dev/ada4 bs=512 count=1 seek=1465138515 conv=noerror,sync
1+0 records in
1+0 records out
512 bytes copied, 0.000233307 s, 2.2 MB/s
wait, sorry, clearly ada4
was a typo. Will retry with correct dev name.
Hm, also I am not sure if LBA address directly translates to seek
argument of the dd
command?
Oh, one thing. dd
fixes Current_Pending_Sector
to become Reallocated_Sector_Ct
.
And I think self-test fails on Offline_Uncorrectable
. Please check on Google. So I am not sure if writing to Offline_Uncorrectable
locations helps anything. But that self-test still tries to do anything with Offline_Uncorrectable
feels a bit scary to me because does this mean also regular file system might try to write there? I though Offline_Uncorrectable
are sectors disk will not use at all and not expose to the system.
Maybe it is time to simply remove this disk if self-test is failing.
I retried all the above steps with the correct drive name (/dev/sdh
instead of /dev/ada4
), and played around with the block size/seek numbers, but still haven't gotten it to work. I tried re-running dd
with bs=512
since that seems to be how the underlying disk self-identifies, but days later it hasn't finished yet and is crawling along at 21.3 kB/s... is that transfer speed in and of itself enough to pronounce it dead?
Lol this is slow. I would suggest we remove it, yes.
For some time now server3 has 2 unreadable (pending) sectors. This is good, because it means the problem is not growing and we can keep those disks. But this should still be fixed.
So the issue here is that disk detected two sectors to be bad. This is why they are pending. You should resolve those pending sectors and get disk to remove them from use and replace them with space sectors. The process is a bit involved, but it is fun:
dd
tool) with write blocks of size equal to disk blocks with zerosfdisk
tool)I suggest you go through and figure our all commands to do this and then log them here.