Closed UtkarshBhatthere closed 1 month ago
Here are the unabridged logs (too large to upload directly to GitHub): https://drive.google.com/file/d/1_jHzVzPzUmXgCH6YWQrMsVl5LL7qIeMO/view?usp=sharing
I realize this is a long time ago -- apologies for not looking at this sooner.
From the mon log I can e.g. see messages about slow ops and a leader election taking 70s around 2024-02-18T01:33:00 on
2024-02-18T01:34:03.851+0000 7fe970160640 0 log_channel(cluster) log [DBG] : mds.0 [v2:10.10.0.40:6800/3127883264,v1:10.10.0.40:6801/3127883264] up:active
2024-02-18T01:34:03.851+0000 7fe970160640 1 mon.meatwad@0(leader).mds e414 check_health: resetting beacon timeouts due to mon delay (slow election?) of 70.2781 seconds
2024-02-18T01:34:03.851+0000 7fe970160640 0 log_channel(cluster) log [DBG] : fsmap nabasny:1 {0=meatwad=up:active}
2024-02-18T01:34:05.455+0000 7fe97516a640 -1 mon.meatwad@0(leader) e2 get_health_metrics reporting 1 slow ops, oldest is log(1 entries from seq 347245 at 2024-02-18T01:33:15.790258+0000)
Also in OSD logs I can see heartbeat timeouts. Subsequently OSDs committed suicide ie. terminated themselves
ceph-osd.0.log.txt:2024-02-18T01:34:11.640+0000 7f27f6cda640 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f27dba47640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dd24a640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dd24a640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dca49640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dca49640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dc248640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dc248640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:57.729+0000 7f27db246640 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27db246640' had suicide timed out after 150.000000000s
...
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
This is expected behaviour for OSDs. When an OSD detects that a particular operation or thread has not responded within a specified timeout period, known as the "suicide timeout," it may decide to terminate itself.
The reason for the election and op slowness as well as the OSD termination could be a network hit, or alternatively resource starvation that would have prevented cluster services communicating properly.
Would it be possible that something like this happened?
Yes, that could be. Since then I've upgraded the RAM on that machine and haven't noticed any more OSD outages.
Thanks, I'll close this then -- feel free to reopen if this pops up again
Issue report
What happens (observed behaviour) ?
Abrupt Ceph OSD daemon shutdown.
What were you expecting to happen ?
Ceph services should not die.
Relevant logs, error output, etc.
Relevant Logs osd0.log