frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

a drive failure on backup server #228

Closed hirokomatsui closed 7 years ago

hirokomatsui commented 7 years ago

img_1802

It seems like the backup server's not mounted on flh1 for now.

I see one of the drives with red LED on the server. In the attached pic, 3rd from the top and 3rd from the left. I believe it's still under warranty.

tatarsky commented 7 years ago

I don't see what you mean on the not mounted part.

$ cd /backup01/
fl-hn1 [backup01]$ ls
projects  test
$ df -h .
Filesystem         Size  Used Avail Use% Mounted on
fl-nas1:/backup01  281T  169T  113T  61% /backup01

Please remember these are NFS automounts. Which means they get mounted when you access them.

I am checking on the drive state. I got no alerts so I'm curious whats involved.

tatarsky commented 7 years ago

Hmm. No reported drive failures. Looking closely at the picture.

tatarsky commented 7 years ago

Weird. I show the "locate" mode enabled on that drive. Which I can't seem to turn off. But the drive appears fine. Investigating.

tatarsky commented 7 years ago

Please confirm however that you see what I mean that /backup01 is mounted on fl-hn1.

tatarsky commented 7 years ago

We'll call in this drive regardless to AHPC. I cannot see any reason why I should not be able to disable "locate" on it. But it appears intact and functioning. Weird. I'll kick them an email.

tatarsky commented 7 years ago

Nice spot BTW! I have no alert for "locate light stuck on" ;)

tatarsky commented 7 years ago

Oh and remember I have /backup01 only on fl-hn1. If you want it on fl-hn2 just let me know.

tatarsky commented 7 years ago

I may have just turned that light off. But remember the following:

0:0:24:0/Slot15 sdp fault_off locate_off

That was "LOCATE_ON" and I don't recall turning it on and I don't think there is any way to press a button on that drive cage to do so. So at the moment I think we will still call in the drive.

The zpool is clean however and that drive reports no errors. I would feel better however getting it replaced if they will let us.

tatarsky commented 7 years ago
smartctl -H /dev/sdp
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-642.6.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

SMART Health Status: OK
tatarsky commented 7 years ago

Just noting that for record. Its of the opinion its fine so we'll ask AHPC what our options are.

tatarsky commented 7 years ago

Interestingly apparently I did turn on locate back when this machine was delivered. I can tell from the root history. I have no idea why I did that. Do you remember anything like that? I'll check my old emails but I definitely turned on locate on that drive. I show no evidence I turned it back off until now.

tatarsky commented 7 years ago

And my command to turn off the light was wrong. So my statement of "I cannot turn off the locate light" is false. I just cannot read the man page for ledctl properly. (the program to turn the light off and on)

So at the moment I believe this was "my error" and that the drive is fine. Please advise what you want to do. The zpool is 100% OK.

tatarsky commented 7 years ago

Aha! (The sound of memory cells working). I have remembered the chain of events in part by looking at the history commands before I enabled locate on this drive. When this system was delivered by AHPC it came up "short" this drive. I had Cesar on the phone and he re-seated the tray and the drive comes up (I can see me checking it with smartctl).

To confirm we were talking about the same drive I turned on identify. And then configured it. But I never turned off identify. So I believe there is no issue. Sorry for the scare.