canonical / microceph

MicroCeph is snap-deployed Ceph with built-in clustering
https://snapcraft.io/microceph
GNU Affero General Public License v3.0
210 stars 33 forks source link

MicroCeph abrupt OSD daemon shutdown #314

Closed UtkarshBhatthere closed 1 month ago

UtkarshBhatthere commented 7 months ago

Issue report

What happens (observed behaviour) ?

Abrupt Ceph OSD daemon shutdown.

What were you expecting to happen ?

Ceph services should not die.

Relevant logs, error output, etc.

Relevant Logs osd0.log

slapcat commented 7 months ago

Here are the unabridged logs (too large to upload directly to GitHub): https://drive.google.com/file/d/1_jHzVzPzUmXgCH6YWQrMsVl5LL7qIeMO/view?usp=sharing

sabaini commented 1 month ago

I realize this is a long time ago -- apologies for not looking at this sooner.

From the mon log I can e.g. see messages about slow ops and a leader election taking 70s around 2024-02-18T01:33:00 on

2024-02-18T01:34:03.851+0000 7fe970160640  0 log_channel(cluster) log [DBG] : mds.0 [v2:10.10.0.40:6800/3127883264,v1:10.10.0.40:6801/3127883264] up:active
2024-02-18T01:34:03.851+0000 7fe970160640  1 mon.meatwad@0(leader).mds e414 check_health: resetting beacon timeouts due to mon delay (slow election?) of 70.2781 seconds
2024-02-18T01:34:03.851+0000 7fe970160640  0 log_channel(cluster) log [DBG] : fsmap nabasny:1 {0=meatwad=up:active}
2024-02-18T01:34:05.455+0000 7fe97516a640 -1 mon.meatwad@0(leader) e2 get_health_metrics reporting 1 slow ops, oldest is log(1 entries from seq 347245 at 2024-02-18T01:33:15.790258+0000)

Also in OSD logs I can see heartbeat timeouts. Subsequently OSDs committed suicide ie. terminated themselves

ceph-osd.0.log.txt:2024-02-18T01:34:11.640+0000 7f27f6cda640  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f27dba47640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dd24a640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dd24a640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dca49640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dca49640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:43.357+0000 7f27dc248640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27dc248640' had suicide timed out after 150.000000000s
ceph-osd.0.log.txt:2024-02-18T01:35:57.729+0000 7f27db246640  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f27db246640' had suicide timed out after 150.000000000s
...
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")
ceph-osd.1.log.txt:./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")

This is expected behaviour for OSDs. When an OSD detects that a particular operation or thread has not responded within a specified timeout period, known as the "suicide timeout," it may decide to terminate itself.

The reason for the election and op slowness as well as the OSD termination could be a network hit, or alternatively resource starvation that would have prevented cluster services communicating properly.

Would it be possible that something like this happened?

slapcat commented 1 month ago

Yes, that could be. Since then I've upgraded the RAM on that machine and haven't noticed any more OSD outages.

sabaini commented 1 month ago

Thanks, I'll close this then -- feel free to reopen if this pops up again