bitforge-me / beryllium

1 stars 9 forks source link

added the blockheight_diff_threshold metric to the lightningd_info #119

Closed eoliveros closed 1 year ago

eoliveros commented 1 year ago

this is to fix the blockheight being too sensitive to alarm.

djpnewton commented 1 year ago

i am not sure how this helps the alarm being too sensitive?

eoliveros commented 1 year ago

need the metric for saying 1 or 0. 0 = ok 1 = breached

the tick script would count the no. of times the breach happens in a x minutes. if x >= 60(15minutes), it would send the notification. At the moment the metric does not exists in influxdb to count. the current alarm is looking at the difference between the local blockheight and remote blockheight and send the alarm if its greater than 0.

if the remote does respond then its the blockheight - 1. which basically meant it would alarm straight away because the resulting value is over 60. also configured it the threshold to not be the full 15 minutes. i lessen this to 14 minutes.

djpnewton commented 1 year ago

you already have a metric for that in blockheight_diff

blockheight_diff = 0: ok blockheight_diff != 0: breached

eoliveros commented 1 year ago

you already have a metric for that in blockheight_diff

blockheight_diff = 0: ok blockheight_diff != 0: breached

sql:

> select count("blockheight_diff") FROM "telegraf"."autogen"."lightning_info" WHERE "host" = 'be.bitforge.me' and "blockheight_diff" != 0 and "time" > now() -5m group by time(1m)
>
>

its not returning anything does that mean its ok or not? if i remember correctly, when there's no value return, kapacitor is not sending a notification for "ok" state, which also meant that if the state turns to critical and then no value return afterwards then it will remains critical forever.

The only way i've found to alarm this correctly is to create a metric that would see if the breach has happened or not(similar to a keepalive) and then do a "sum" on this metric. the resulting sum would then be compared to last 15 minutes and if its > 14x4(per minute) send the notification

eoliveros commented 1 year ago

change to draft.

made the changes above and will see if it alarms at all. critical and ok notification.

will need to wait for an alarm to happen before proceeding.

eoliveros commented 1 year ago

changed the count|sum function to mean. also changed the operator to >= 1.

So if the mean is greater than or equal 1, then its the only time that it should alarm.

going to cancel this pr.