JacksonWrath / homelab-base

Homelab - A History
0 stars 0 forks source link

Asuna has gremlins #24

Closed JacksonWrath closed 2 months ago

JacksonWrath commented 3 months ago

...likely PSU gremlins. It's also possible they have spread to the mobo.

I've had enough HDDs die for this to be more than coincidence, especially now that an almost-brand-new drive is reporting UREs after the resilver. Additionally, that resilver reported the exact same number of checksum errors on all of the healthy drives. That points to a controller issue, which is on the motherboard (drives are direct-connected; I pulled out the LSI controller a while back).

JacksonWrath commented 2 months ago

New PSU has arrived, pending replacement. If this issue doesn't go away, time for a new motherboard; thankfully those are still in stock, and relatively cheap since they're a couple gens old.

New HDD has been ordered. Once replaced, need to return the one that's reporting UREs (sorry reseller on Amazon).

JacksonWrath commented 2 months ago

PSU and HDD swapped. Resilver completed and claimed it found new checksum errors (614 across the good drives) on 2 files. Notably, these were different than the previous ones it listed; the previous files are now supposedly good.

I replaced those 2 files for good measure, ran zpool clear, and ran a scrub. It again found a different number of errors (510 on all drives), but the verbose output claimed no files were affected.

I cleared them again and started a 2nd scrub this morning. A bad motherboard is highly likely, but we'll see what it says for this run.

JacksonWrath commented 2 months ago

2nd scrub reported 510 checksum errors again, and again no actual data errors (supposedly). Before ordering a new motherboard, I did a little more research.

It turns out that RAIDZ stripes each block across drives, so when there's a checksum error, it can't know which drive was the problem, and reports the error for all drives the block was on (which in my small array, is always all of them). Therefore, it's not necessarily a controller/motherboard issue.

One of the SATA cables being bad is probable now. I'm using basically spare cables that have been kicking around with me for damn near a decade, and I recently ordered a couple 5-packs of (hopefully) nicer cables. I'm just going to replace them all and rescrub.

JacksonWrath commented 2 months ago

New SATA cables didn't resolve. Still shows the 510 checksum errors, without data errors.

I'm gonna do a quick check that it's the same block IDs being flagged (I have the previous output of zpool events -v); if they are, I wonder if it's the old snapshots being flagged but for some reason ZFS doesn't report those files anymore, and deleting the old snapshots would clear them.

If they aren't the same, or deleting the snapshots doesn't help, time for a new motherboard I guess.

JacksonWrath commented 2 months ago

Interestingly, there's fewer quantity blkids with issues, but more counts per blkid. It does appear that anything flagged in the latest scrub was also flagged in the previous scrub though.

The checksum errors all occur with the last few minutes of the scrub, both times. For the latest scrub, all errors occurred within 2 minutes, then the scrub completed 13 minutes later.

I'm going to try deleting snapshots up to June 26th (first snapshot after I last replaced any files) and see what happens.

JacksonWrath commented 2 months ago

Finally. That cleared the checksum errors. No idea why it wouldn't surface that it was the snapshots' files that were problematic when it previously did so. Whatever, I have a clean scrub. (Latest snapshot and replication also succeeded just fine.)

I don't consider the PSU and cable swaps a waste, because I still suspect that was causing my excessive drive failures and random checksum problems. Time will tell on that. If anything else comes up in the near future...new motherboard.