Magickbase / infra

0 stars 0 forks source link

Add external monitoring message alarm #38

Open mdzh521 opened 1 year ago

mdzh521 commented 1 year ago

To add a fault alert to a sister team, you need to have the email address of the sister team first. There are two ways to achieve this:

  1. Through our existing alert system, we add their email address and create a separate alert branch to notify them in advance when there will be a fault.
  2. Inform through Better Uptime.
mdzh521 commented 1 year ago

Brother team receives alarm rules:

  1. When the CKB node is not out of the block alarm for three minutes, notify the block out delay, only attention. When no block has been issued for five minutes, alert that a failure has occurred and emergency maintenance is underway.
  2. when the CKB browser does not update the block for 3m, notify the browser of the block-out delay and is performing on-chain rollback. 5m without a block-out, alert the ckb browser of a failure and is under emergency maintenance.
  3. when the browser has a non-200 state for 30s, synchronous alert, the browser state is abnormal, and emergency maintenance is underway.

Brother team receives maintenance notification:

  1. Send version upgrade email through one-stop alert platform, template: CKB Testnet node is upgrading system, duration 10m, please pay attention to the node status.
  2. When the service is down, after the repair is completed, send a repair completion email, template: CKB Testnet node can be used normally, please understand the inconvenience caused to you. @Keith-CY
Keith-CY commented 1 year ago

2. when the CKB browser does not update the block for 3m, notify the browser of the block-out delay and is performing on-chain rollback. 5m without a block-out, alert the ckb browser of a failure and is under emergency maintenance.

on-chain rollback has its own signal and not strongly related to 3m no-update, so there will be two rules

  1. 3m no-update and no rollback: pay attention to data latency
  2. 3m no-update and rollback: rollback

Besides, I'm not sure if 5m to alert as emergency maintenance is suitable because ckb explorer may take time to process data, any suggestion from @ShiningRay