ceph / ceph-medic

find common issues in ceph clusters
MIT License
22 stars 18 forks source link

checks: check for OSD suicide timeouts #67

Open haklein opened 7 years ago

haklein commented 7 years ago

OSD can hit suicide timeouts for different reasons, it would be great if ceph-medic could highlight such events from the OSD log files.

haklein commented 7 years ago

Example messages for thread timeouts:

2015-11-18 18:22:05.040871 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had timed out after 60
2015-11-18 18:22:05.040875 7fc1cc1d8700 1 heartbeat_map is_healthy 'OSD::disk_tp thread 0x7fc1c21c4700' had suicide timed out after 60

There are different threads where this can happen (filestore, op, disk, ..), so the check should be very generic ("had timed out after" and "had suicide timed out after")

alfredodeza commented 7 years ago

Is it possible to not have to look at log files. Ceph logs can be tremendously large, ideally some command that could tell us this would be great