Design flaw in log retention destroys reporting information resulting in inappropriate "Undetermined".

BUG REPORT INFORMATION

Centreon Web version: Verified on 2.7.8 and 2.8.26

Centreon Engine version:

Centreon Broker version:

OS: CentOS release 6.8 (Final) and CentOS Linux release 7.3.1611

Additional environment details (AWS, VirtualBox, physical, etc.): kvm virtual hosts.

Steps to reproduce the issue: 1.Monitor any stable service for a given time without any events.
2.Define "Retention duration for logs" shorter than the stable period above (we have 180 days) 3.Rebuild the state tables as suggested in several similar bugreports about excessive Undetermined time. ( eventReportBuilder -r ; dashboardBuilder -r)

Describe the results you received: Reports for the service goes to 100% Undetermined

Describe the results you expected: 100% OK as before the rebuild.

Additional information you think important (e.g. issue happens only occasionally): There are a number of other variants depending on the event history in the logs.

If there is an event in the log at half the retention duration back in time duration before this event is considered Undetermined.
if log is lost wrong state is reported from previous log event up to next event or current time, even if the current service state "last_state_change" indicate an overlapping different state. A new event is required to correct current state.

The root cause is due to the way retention works. Every night all log data older than the retention duration are deleted. When the nightly eventReportBuilder run it will only rebuild the last day and this is OK as old servicestateevents are kept. However, rebuilding with -r will truncate the servicestateevents table. Thus all history before the retention time is lost, with a possible exception for the "last_state_change" info in current state. When the initial event record for a service without any other events is lost, all history for this service is lost and it appears as 100% "Undetermined" until a new event occur.

The same problem exists for hoststate.

I am testing a fix for this and will post details when verified. Changes:

Do not truncate *stateevents. Delete all after the oldest log instead. Then keep only the last remaining entry for each service and set the "last_update" flag on these.
When rebuilding, check if there is a recorded state for the service. If not, insert one based on the current state. Insert a "fake" INITIAL record in the logs to make it persist.
Check if current state matches last recorded state if "last_state_change" overlaps the processing period ( day). If not, insert a state record based on the current state. Insert a "fake" record in the logs to make it persist.

My fix seems to recover as much history as possible. However, the underlying problem is with purge. 2.8 introduced partitioned tables. In 2.7 one could just save the last event before the retention limit for each service. In 2.8 these saved log entries will span a lot of partitions, but all "old" partitions are deleted by purge. A strategy could be:

Find the time limit R for the oldest partition Po to keep ( from the name of the youngest to delete?)
For each service or host, find youngest log entry Ey with ctime < R. (among the records to be deleted)
- next if E(ctime=R) exist.
- Create a new "initial" log entry Ei with ctime = R and Ei(status) = Ey(status). ( Make sure Ei is recorded in Po in 2.8.x )
Purge as before.

Any better ideas?

centreon / centreon-archived

Design flaw in log retention destroys reporting information resulting in inappropriate "Undetermined". #6802

BUG REPORT INFORMATION