Open lathiat opened 1 month ago
Shouldn't this be handled by the deployer i.e. the Ceph charms and/or the FE team? A hotsos check might be useful but probably not the most effective place for this.
Yes ideally, but in practice it keeps getting missed. So we need to catch it. Both for analysing new deployments but also detecting the issue on old deployments.
It can also happen because the charm will create OSDs with no DB if it can't find any space, so if you add new OSDs, and the old DB devices were full, a customer could silently have this happen. Or that can happen in a field deployment due to a weird issue even if they designed it right.
@pponnuvel i agree that the charm should be doing this as a first port of call and we should open a bug on the charm to get this done. In the interim, if it is a small enough addition to the checks we could add this to cover the cases where the charm does not yet support it since it has been cropping up repeatedly in deployments and this will help reduce time of analysis by flagging the issue at the start.
@lathiat this almost looks like several checks and it might make sense to break it into smaller chunks to make it easier to implement
A common fault in Ceph deployments is that the DB devices are incorrectly configured (missed or allocate from the wrong device), or not big enough. The majority of the time these would be picked up by looking for: