The only way to verify if the contrail db is in a good state is currently to run db_manage.py check manually, as no alarm is produced if inconsistencies are found.
Considering how we've been told that abrupt shutdowns of controllers or other hardware failures could introduce db inconsistencies, I believe it would be crucial to have a solid way to monitor the state of the db. This would also allow us to record when inconsistencies started appearing; time correlation would then in turn help us understand what could have induced them.
I propose for a nrpe check to be created, which would periodically run the check function and alert on inconsistencies. Since the check function may need several seconds to complete, it would make sense to decouple the alert generation from the checking itself by having the nrpe check only look at cached output. A cronjob could take care of periodically running db_manage.py to refresh the cache.
The only way to verify if the contrail db is in a good state is currently to run
db_manage.py check
manually, as no alarm is produced if inconsistencies are found.Considering how we've been told that abrupt shutdowns of controllers or other hardware failures could introduce db inconsistencies, I believe it would be crucial to have a solid way to monitor the state of the db. This would also allow us to record when inconsistencies started appearing; time correlation would then in turn help us understand what could have induced them.
I propose for a nrpe check to be created, which would periodically run the check function and alert on inconsistencies. Since the check function may need several seconds to complete, it would make sense to decouple the alert generation from the checking itself by having the nrpe check only look at cached output. A cronjob could take care of periodically running
db_manage.py
to refresh the cache.