Icinga / icinga-reports

Icinga Reports 1.x MySQL (EOL 31.12.2018)
GNU General Public License v2.0
14 stars 13 forks source link

[dev.icinga.com #4511] telling 100% down if no data is found #60

Open icinga-migration opened 11 years ago

icinga-migration commented 11 years ago

This issue has been migrated from Redmine: https://dev.icinga.com/issues/4511

Created by tkoeberl on 2013-08-05 09:22:48 +00:00

Assignee: (none) Status: New Target Version: Backlog Last Update: 2015-08-21 21:35:31 +00:00 (in Redmine)

Icinga Version: 1.8.4
DB Type: MySQL
DB Version: mysql Ver 14.14 Distrib 5.1.66, for debian-linux-gnu (x86_64) using readline 6.1
Jasper Version: 5.0.0

It looks like the reporting is telling me 100% down if there is no data found for this time.

Using this data:

Select * from icinga_statehistory where icinga_statehistory.object_id=3215 and state_time >'2013-04-01 00:00:00'
statehistory_id    instance_id    state_time    state_time_usec    object_id    state_change    state    state_type    current_check_attempt    max_check_attempts    last_state    last_hard_state    output    long_output
769059    1    08/04/2013 12:58:10    575576    3215    1    2    0    1    3    0    0    CHECK_NRPE: Socket timeout after 10 seconds.    (null)
769061    1    08/04/2013 12:59:01    77334    3215    1    0    0    2    3    2    0    PROCS OK: 1 process with args 'DAP_edit_srv01'    (null)
783632    1    11/04/2013 21:51:24    203963    3215    1    2    1    1    3    0    2    CHECK_NRPE: Socket timeout after 10 seconds.    (null)
783681    1    11/04/2013 21:56:15    26459    3215    1    0    1    1    3    2    2    PROCS OK: 1 process with args 'DAP_edit_srv01'    (null)

I get 100% downtime for the timeframe from 12.04 00:00 -13.04 00:00

object_id    name1    name2    state    sla
3215    XXXXX    srv01_process    0    0.00000000
3215    XXXXX    srv01_process    1    1.00000000

and correct data for the timeframe from 11.04 - 13.04

object_id    name1    name2    state    sla
3215    XXXXX    srv01_process    0    0.99831600
3215    XXXXX    srv01_process    1    0.00168400

applying the patch: https://dev.icinga.org/issues/4152 did not change anything. http://www.monitoring-portal.org/wbb/index.php?page=Thread&threadID=28658

Attachments

icinga-migration commented 11 years ago

Updated by mfriedrich on 2013-08-05 09:24:15 +00:00

icinga-migration commented 11 years ago

Updated by mfriedrich on 2013-08-05 09:25:02 +00:00

wrong section, moved.

icinga-migration commented 11 years ago

Updated by tgelf on 2013-08-20 13:17:37 +00:00

Has there a downtime (icinga_downtimehistory) been set in the chosen timeperiod?

icinga-migration commented 11 years ago

Updated by tkoeberl on 2013-09-05 11:56:34 +00:00

No: select * from icinga_downtimehistory where icinga_downtimehistory.object_id=3215 gives me only later dates.

downtimehistory_id  instance_id downtime_type   object_id   entry_time  author_name comment_data    internal_downtime_id    triggered_by_id is_fixed    duration    scheduled_start_time    scheduled_end_time  was_started actual_start_time   actual_start_time_usec  actual_end_time actual_end_time_usec    was_cancelled   is_in_effect    trigger_time
2167    1   1   3215    06/05/2013 12:51:05 user    comment 957 0   1   7200    06/05/2013 12:50:51 06/05/2013 14:50:51 1   06/05/2013 12:51:15 102855  06/05/2013 14:50:51 240559  0   1   06/05/2013 12:51:15
2590    1   1   3215    10/07/2013 11:33:40 user    comment 1380    0   1   720 10/07/2013 11:33:07 10/07/2013 11:45:07 1   10/07/2013 11:33:51 895930  10/07/2013 11:45:07 177049  0   1   10/07/2013 11:33:51
icinga-migration commented 11 years ago

Updated by tkoeberl on 2013-09-05 13:02:44 +00:00

Maybe that helps. All for the same host/service in the last month: Using this data:

statehistory_id instance_id state_time  state_time_usec object_id   state_change    state   state_type  current_check_attempt   max_check_attempts  last_state  last_hard_state output  long_output
1036773 1   12/08/2013 16:37:16 864225  3215    1   2   1   1   3   0   2   CHECK_NRPE: Socket timeout after 10 seconds.    (null)
1036918 1   12/08/2013 16:40:52 179496  3215    1   0   1   1   3   2   2   PROCS OK: 1 process with args 'DAP_edit_srv01'  (null)
1039456 1   13/08/2013 14:01:40 942330  3215    1   2   1   1   3   0   2   CHECK_NRPE: Socket timeout after 10 seconds.    (null)
1039501 1   13/08/2013 14:05:52 91958   3215    1   0   1   1   3   2   2   PROCS OK: 1 process with args 'DAP_edit_srv01'  (null)
1072083 1   28/08/2013 13:18:20 160623  3215    1   2   1   1   3   0   2   CHECK_NRPE: Socket timeout after 10 seconds.    (null)
1072516 1   28/08/2013 13:23:56 652165  3215    1   2   0   1   3   2   2   CHECK_NRPE: Socket timeout after 10 seconds.    (null)
1072599 1   28/08/2013 13:24:11 230734  3215    1   0   0   2   3   2   2   PROCS OK: 1 process with args 'DAP_edit_srv01'  (null)

I get 11% downtime from '2013-08-01 00:00:00' to '2013-08-31 23:59:59'

object_id   name1   name2   state   sla
3215    xxx DAP_srv01_process   0   0.88867700
3215    xxx DAP_srv01_process   1   0.11132300

Now some maths: The whole amount of time is

select time_to_sec(timediff('2013-08-01 00:00:00', '2013-08-31 23:59:59' )) / 3600;
743.9997

Downtime as hours:
743.9997 * 0.11132300 = 82.8242786031
This is nearly the exact amount of:

select time_to_sec(timediff('2013-08-31 23:59:59', '2013-08-28 13:24:11' )) / 3600;
82.5967

For me it looks that the calculation is NOT assuming that the Service is running right now, but it is ;)

icinga-migration commented 9 years ago

Updated by berk on 2015-05-18 12:17:46 +00:00

icinga-migration commented 9 years ago

Updated by mdetrano on 2015-08-21 21:35:31 +00:00

I came across this problem, too, and was able to get the reports working a little better with the attached patch applied to the availability.sql script to create the stored icinga_availability function in mysql.

The patch makes sure at least an initial state record exists at the "start" time of the report, by using data from the most recent event in the statehistory table prior to that time.

Patch also deals with an issue where downtimes that occur after the start of the report, but never get triggered, were causing the report to show "100% ok", even if that wasn't the case.

There's probably better ways to fix this issue but I hope this may help as a workaround for now for anyone looking to fix the availability reports.