Ceph: Detect small DB partition sizes and unused partitions

canonical / hotsos

Software analysis toolkit. Define checks in high-level language and leverage library to perform analysis of common Cloud applications.

Apache License 2.0

33 stars 38 forks source link

Ceph: Detect small DB partition sizes and unused partitions #974

Open lathiat opened 1 month ago

lathiat commented 1 month ago

A common fault in Ceph deployments is that the DB devices are incorrectly configured (missed or allocate from the wrong device), or not big enough. The majority of the time these would be picked up by looking for:

DB partitions which are obviously far too small, e.g. the default 1GB. Ideally we'd report the DB-to-OSD size ratio informationally
Empty partitions that have not been used
Empty space on a disk that is not partition
Volume groups which are not mostly used (basically the same as empty space on a disk)

pponnuvel commented 1 month ago

Shouldn't this be handled by the deployer i.e. the Ceph charms and/or the FE team? A hotsos check might be useful but probably not the most effective place for this.

lathiat commented 1 month ago

Yes ideally, but in practice it keeps getting missed. So we need to catch it. Both for analysing new deployments but also detecting the issue on old deployments.

lathiat commented 1 month ago

It can also happen because the charm will create OSDs with no DB if it can't find any space, so if you add new OSDs, and the old DB devices were full, a customer could silently have this happen. Or that can happen in a field deployment due to a weird issue even if they designed it right.

dosaboy commented 1 month ago

@pponnuvel i agree that the charm should be doing this as a first port of call and we should open a bug on the charm to get this done. In the interim, if it is a small enough addition to the checks we could add this to cover the cases where the charm does not yet support it since it has been cropping up repeatedly in deployments and this will help reduce time of analysis by flagging the issue at the start.

dosaboy commented 1 month ago

@lathiat this almost looks like several checks and it might make sense to break it into smaller chunks to make it easier to implement