Closed xihh87 closed 5 years ago
El problema podría ser memoria en falla.
/var/log/kern.log:
[1045294.003934] nfs: server visnu OK
[1046299.820399] megaraid_sas 0000:81:00.0: scanning for scsi6...
[1046405.582285] megaraid_sas 0000:81:00.0: scanning for scsi6...
[1046405.598811] megaraid_sas 0000:81:00.0: 21012 (612911605s/0x0001/CRIT) - VD 00/0 is now DEGRADED
[1046405.599263] megaraid_sas 0000:81:00.0: scanning for scsi6...
[1046405.599387] megaraid_sas 0000:81:00.0: 21017 (612911605s/0x0001/FATAL) - VD 00/0 is now OFFLINE
[1046405.608347] megaraid_sas 0000:81:00.0: 21082 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) communication lost
[1046405.608454] megaraid_sas 0000:81:00.0: 21083 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) not responding
[1047095.203279] WARNING: Pool 'humongous' has encountered an uncorrectable I/O failure and has been suspended.
[1047097.526388] WARNING: Pool 'humongous' has encountered an uncorrectable I/O failure and has been suspended.
[1047247.003972] WARNING: Pool 'humongous' has encountered an uncorrectable I/O failure and has been suspended.
journalctl -b
:
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/sda, failed to read Temperature
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_17] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_18] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 191 to 196
Jun 03 16:02:32 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_20] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 193
Jun 03 16:02:33 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_21] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 193
Jun 03 16:02:34 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_22] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 193
Jun 03 16:02:35 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_23] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Jun 03 16:02:36 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_24] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 193 to 200
Jun 03 16:02:36 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_25] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 193
Jun 03 16:02:37 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_26] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 176 to 181
Jun 03 16:02:39 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_27] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:39 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_28] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 187
Jun 03 16:02:40 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_29] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 206 to 214
Jun 03 16:02:42 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_30] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 187 to 200
Jun 03 16:02:43 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_32] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 187
Jun 03 16:02:45 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_33] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 176 to 181
Jun 03 16:02:45 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_34] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:46 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_35] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 193 to 206
Jun 03 16:02:47 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_36] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 181 to 193
Jun 03 16:02:48 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_37] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 176 to 193
Jun 03 16:02:49 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_38] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:50 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_39] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 171 to 181
Jun 03 16:02:51 indra smartd[4481]: Device: /dev/bus/6 [megaraid_disk_40] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 166 to 181
[…]
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: scanning for scsi6...
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21012 (612911605s/0x0001/CRIT) - VD 00/0 is now DEGRADED
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: scanning for scsi6...
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21017 (612911605s/0x0001/FATAL) - VD 00/0 is now OFFLINE
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21082 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) communication lost
Jun 03 16:13:18 indra kernel: megaraid_sas 0000:81:00.0: 21083 (612911606s/0x0004/CRIT) - Enclosure PD 10(c Port 0 - 3/p1) not responding
No hay dead.letter:
xihh@indra:~$ ls /
100g/ boot/ dev/ home/ lib/ lib64/ lost+found/ mnt/ proc/ reference/ respaldo/ run/ snap/ sys/ usr/ vault/
bin/ castle@ etc/ labs/ lib32/ libexec/ media/ opt/ recovery/ remote/ root/ sbin/ srv/ tmp/ var/
El proveedor dice que en casos similares, el problema es la controladora.
Reiniciando el sistema, hay que revisar qué error muestra la controladora.
Recomienda revisar el status por la interfaz de la controladora.
sudo megacli -EncStatus -a0
^M
Enclosure 0
Number of Slots : 24
Slot : 0
Slot Status : OK
Slot : 1
Slot Status : OK
Slot : 2
Slot Status : OK
Slot : 3
Slot Status : OK
Slot : 4
Slot Status : OK
Slot : 5
Slot Status : OK
Slot : 6
Slot Status : OK
Slot : 7
Slot Status : OK
Slot : 8
Slot Status : OK
Slot : 9
Slot Status : OK
Slot : 10
Slot Status : OK
Slot : 11
Slot Status : OK
Slot : 12
Slot Status : OK
Slot : 13
Slot Status : OK
Slot : 14
Slot Status : OK
Slot : 15
Slot Status : OK
Slot : 16
Slot Status : OK
Slot : 17
Slot Status : OK
Slot : 18
Slot Status : OK
Slot : 19
Slot Status : OK
Slot : 20
Slot Status : OK
Slot : 21
Slot Status : OK
Slot : 22
Slot Status : OK
Slot : 23
Slot Status : OK
Number of Power Supplies : 0
Number of Fans : 0
Number of Temperature Sensors : 2
Temp Sensor : 0
Temperature : 28
Temperature Sensor Status : OK
Temp Sensor : 1
Temperature : 69
Temperature Sensor Status : OK
Number of Alarms : 0
Number of SIM Modules : 0
Exit Code: 0x00
Se detuvo condor en indra para revisar.
xihh@indra:~$ sudo systemctl stop condor
xihh@indra:~$ sudo systemctl disable condor
condor.service is not a native service, redirecting to systemd-sysv-install
Executing /lib/systemd/systemd-sysv-install disable condor
insserv: warning: current start runlevel(s) (empty) of script `condor' overrides LSB defaults (2 3 4 5).
insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `condor' overrides LSB defaults (0 1 6).
xihh@indra:~$ sudo systemctl poweroff
El proceso se resolvió de esta forma.