Closed eoliveros closed 1 year ago
i am not sure how this helps the alarm being too sensitive?
need the metric for saying 1 or 0. 0 = ok 1 = breached
the tick script would count the no. of times the breach happens in a x minutes. if x >= 60(15minutes), it would send the notification. At the moment the metric does not exists in influxdb to count. the current alarm is looking at the difference between the local blockheight and remote blockheight and send the alarm if its greater than 0.
if the remote does respond then its the blockheight - 1. which basically meant it would alarm straight away because the resulting value is over 60. also configured it the threshold to not be the full 15 minutes. i lessen this to 14 minutes.
you already have a metric for that in blockheight_diff
blockheight_diff = 0
: ok
blockheight_diff != 0
: breached
you already have a metric for that in
blockheight_diff
blockheight_diff = 0
: okblockheight_diff != 0
: breached
sql:
> select count("blockheight_diff") FROM "telegraf"."autogen"."lightning_info" WHERE "host" = 'be.bitforge.me' and "blockheight_diff" != 0 and "time" > now() -5m group by time(1m)
>
>
its not returning anything does that mean its ok or not? if i remember correctly, when there's no value return, kapacitor is not sending a notification for "ok" state, which also meant that if the state turns to critical and then no value return afterwards then it will remains critical forever.
The only way i've found to alarm this correctly is to create a metric that would see if the breach has happened or not(similar to a keepalive) and then do a "sum" on this metric. the resulting sum would then be compared to last 15 minutes and if its > 14x4(per minute) send the notification
change to draft.
made the changes above and will see if it alarms at all. critical and ok notification.
will need to wait for an alarm to happen before proceeding.
changed the count|sum function to mean. also changed the operator to >= 1.
So if the mean is greater than or equal 1, then its the only time that it should alarm.
going to cancel this pr.
this is to fix the blockheight being too sensitive to alarm.