DDT-INMEGEN / cluster

Documentation and issues for INMEGEN clúster
0 stars 0 forks source link

Arreglo de discos en indra falló. #61

Closed xihh87 closed 5 years ago

xihh87 commented 5 years ago
xihh@indra:~$ sudo megaraidsas-status
-- Arrays informations --
-- ID | Type | Size | Status

-- Disks informations
-- ID | Model | Status | Warnings
a0e16s0 | ATA WDC WD80PURZ-85Y | BAD
a0e16s1 | ATA WDC WD8001FFWX-6 | BAD
a0e16s2 | ATA WDC WD8001FFWX-6 | BAD
a0e16s3 | ATA WDC WD8001FFWX-6 | BAD
a0e16s4 | ATA WDC WD8001FFWX-6 | BAD
a0e16s5 | ATA WDC WD8001FFWX-6 | BAD
a0e16s6 | ATA WDC WD8001FFWX-6 | BAD
a0e16s7 | ATA WDC WD8001FFWX-6 | BAD
a0e16s8 | ATA WDC WD8001FFWX-6 | BAD
a0e16s9 | ATA WDC WD8001FFWX-6 | BAD
a0e16s10 | ATA WDC WD8001FFWX-6 | BAD
a0e16s11 | ATA WDC WD8001FFWX-6 | BAD
a0e16s12 | ATA WDC WD8001FFWX-6 | BAD
a0e16s13 | ATA WDC WD8001FFWX-6 | BAD
a0e16s14 | ATA WDC WD8001FFWX-6 | BAD
a0e16s15 | ATA WDC WD8001FFWX-6 | BAD
a0e16s16 | ATA WDC WD8001FFWX-6 | BAD
a0e16s17 | ATA WDC WD8001FFWX-6 | BAD
a0e16s18 | ATA WDC WD8001FFWX-6 | BAD
a0e16s19 | ATA WDC WD8001FFWX-6 | BAD
a0e16s20 | ATA WDC WD8001FFWX-6 | BAD
a0e16s21 | ATA WDC WD8001FFWX-6 | BAD
a0e16s22 | ATA WDC WD8001FFWX-6 | BAD
a0e16s23 | ATA WDC WD8001FFWX-6 | BAD

There is at least one disk/array in a NOT OPTIMAL state.
xihh87 commented 5 years ago

El problema podría ser memoria en falla.

/var/log/kern.log:

[1045294.003934] nfs: server visnu OK
[1046299.820399] megaraid_sas 0000:81:00.0: scanning for scsi6...
[1046405.582285] megaraid_sas 0000:81:00.0: scanning for scsi6...
[1046405.598811] megaraid_sas 0000:81:00.0: 21012 (612911605s/0x0001/CRIT) - VD 00/0 is now DEGRADED
[1046405.599263] megaraid_sas 0000:81:00.0: scanning for scsi6...
[1046405.599387] megaraid_sas 0000:81:00.0: 21017 (612911605s/0x0001/FATAL) - VD 00/0 is now OFFLINE
[1046405.608347] megaraid_sas 0000:81:00.0: 21082 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) communication lost
[1046405.608454] megaraid_sas 0000:81:00.0: 21083 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) not responding
[1047095.203279] WARNING: Pool 'humongous' has encountered an uncorrectable I/O failure and has been suspended.

[1047097.526388] WARNING: Pool 'humongous' has encountered an uncorrectable I/O failure and has been suspended.

[1047247.003972] WARNING: Pool 'humongous' has encountered an uncorrectable I/O failure and has been suspended.

journalctl -b:

Jun 03 16:02:32 indra smartd[4481]: Device: /dev/sda, failed to read Temperature
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_17] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_18] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 191 to 196
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_20] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 193
Jun 03 16:02:33 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_21] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 193
Jun 03 16:02:34 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_22] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 193
Jun 03 16:02:35 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_23] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Jun 03 16:02:36 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_24] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 193 to 200
Jun 03 16:02:36 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_25] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 193
Jun 03 16:02:37 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_26] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 176 to 181
Jun 03 16:02:39 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_27] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:39 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_28] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 187
Jun 03 16:02:40 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_29] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Jun 03 16:02:42 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_30] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 200
Jun 03 16:02:43 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_32] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 187
Jun 03 16:02:45 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_33] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 176 to 181
Jun 03 16:02:45 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_34] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:46 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_35] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 193 to 206
Jun 03 16:02:47 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_36] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 193
Jun 03 16:02:48 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_37] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 176 to 193
Jun 03 16:02:49 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_38] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:50 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_39] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:51 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_40] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 166 to 181
[…]
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: scanning for scsi6...
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21012 (612911605s/0x0001/CRIT) - VD 00/0 is now DEGRADED
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: scanning for scsi6...
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21017 (612911605s/0x0001/FATAL) - VD 00/0 is now OFFLINE
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21082 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) communication lost
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21083 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) not responding

No hay dead.letter:

xihh@indra:~$ ls /
100g/  boot/    dev/  home/  lib/    lib64/    lost+found/  mnt/  proc/      reference/  respaldo/  run/   snap/  sys/  usr/  vault/
bin/   castle@  etc/  labs/  lib32/  libexec/  media/       opt/  recovery/  remote/     root/      sbin/  srv/   tmp/  var/
xihh87 commented 5 years ago

El proveedor dice que en casos similares, el problema es la controladora.

Reiniciando el sistema, hay que revisar qué error muestra la controladora.

Recomienda revisar el status por la interfaz de la controladora.

xihh87 commented 5 years ago
sudo megacli -EncStatus -a0
^M
Enclosure 0

Number of Slots              : 24

Slot                         : 0
Slot Status                  : OK

Slot                         : 1
Slot Status                  : OK

Slot                         : 2
Slot Status                  : OK

Slot                         : 3
Slot Status                  : OK

Slot                         : 4
Slot Status                  : OK

Slot                         : 5
Slot Status                  : OK

Slot                         : 6
Slot Status                  : OK

Slot                         : 7
Slot Status                  : OK

Slot                         : 8
Slot Status                  : OK

Slot                         : 9
Slot Status                  : OK

Slot                         : 10
Slot Status                  : OK

Slot                         : 11
Slot Status                  : OK

Slot                         : 12
Slot Status                  : OK

Slot                         : 13
Slot Status                  : OK

Slot                         : 14
Slot Status                  : OK

Slot                         : 15
Slot Status                  : OK

Slot                         : 16
Slot Status                  : OK

Slot                         : 17
Slot Status                  : OK

Slot                         : 18
Slot Status                  : OK

Slot                         : 19
Slot Status                  : OK

Slot                         : 20
Slot Status                  : OK

Slot                         : 21
Slot Status                  : OK

Slot                         : 22
Slot Status                  : OK

Slot                         : 23
Slot Status                  : OK

Number of Power Supplies     : 0

Number of Fans               : 0

Number of Temperature Sensors : 2

Temp Sensor                  : 0
Temperature                  : 28
Temperature Sensor Status    : OK

Temp Sensor                  : 1
Temperature                  : 69
Temperature Sensor Status    : OK

Number of Alarms             : 0

Number of SIM Modules        : 0

Exit Code: 0x00

Se detuvo condor en indra para revisar.

xihh@indra:~$ sudo systemctl stop condor
xihh@indra:~$ sudo systemctl disable condor
condor.service is not a native service, redirecting to systemd-sysv-install
Executing /lib/systemd/systemd-sysv-install disable condor
insserv: warning: current start runlevel(s) (empty) of script `condor' overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `condor' overrides LSB defaults (0 1 6).
xihh@indra:~$ sudo systemctl poweroff
xihh87 commented 5 years ago

El proceso se resolvió de esta forma.