centreon / centreon-archived

Centreon is a network, system and application monitoring tool. Centreon is the only AIOps Platform Providing Holistic Visibility to Complex IT Workflows from Cloud to Edge.
https://www.centreon.com
GNU General Public License v2.0
574 stars 241 forks source link

Design flaw in log retention destroys reporting information resulting in inappropriate "Undetermined". #6802

Open oyvjel opened 6 years ago

oyvjel commented 6 years ago

BUG REPORT INFORMATION

Centreon Web version: Verified on 2.7.8 and 2.8.26

Centreon Engine version:

Centreon Broker version:

OS: CentOS release 6.8 (Final) and CentOS Linux release 7.3.1611

Additional environment details (AWS, VirtualBox, physical, etc.): kvm virtual hosts.

Steps to reproduce the issue: 1.Monitor any stable service for a given time without any events.
2.Define "Retention duration for logs" shorter than the stable period above (we have 180 days) 3.Rebuild the state tables as suggested in several similar bugreports about excessive Undetermined time. ( eventReportBuilder -r ; dashboardBuilder -r)

Describe the results you received: Reports for the service goes to 100% Undetermined

Describe the results you expected: 100% OK as before the rebuild.

Additional information you think important (e.g. issue happens only occasionally): There are a number of other variants depending on the event history in the logs.

The root cause is due to the way retention works. Every night all log data older than the retention duration are deleted. When the nightly eventReportBuilder run it will only rebuild the last day and this is OK as old servicestateevents are kept. However, rebuilding with -r will truncate the servicestateevents table. Thus all history before the retention time is lost, with a possible exception for the "last_state_change" info in current state. When the initial event record for a service without any other events is lost, all history for this service is lost and it appears as 100% "Undetermined" until a new event occur.

The same problem exists for hoststate.

I am testing a fix for this and will post details when verified. Changes:

oyvjel commented 5 years ago

My fix seems to recover as much history as possible. However, the underlying problem is with purge. 2.8 introduced partitioned tables. In 2.7 one could just save the last event before the retention limit for each service. In 2.8 these saved log entries will span a lot of partitions, but all "old" partitions are deleted by purge. A strategy could be:

Any better ideas?