elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.13k forks source link

[Uptime] Monitor status alert is broken for multi step browser monitors #118998

Closed shahzad31 closed 2 years ago

shahzad31 commented 2 years ago

Kibana version: master

If you have a monitor with multi steps, and if one step failed and other succeeds, configured alert for that monitor has flip flap between recovered and down state.

This is especially true if you have a timeout in one of the step. like for example if alert interval is 1 minute and failed step execution takes 60 seconds, in that it will flip flop, since alert will only work on one step status,

Main issue is that in the query we are using monitor.status field to check downs status of the monitor.

This part of the code is the main problem

https://github.com/elastic/kibana/blob/main/x-pack/plugins/uptime/server/lib/requests/get_monitor_status.ts#L61

            {
              term: {
                'monitor.status': STATUS,
              },
            },

There are two ways we can fix this, either add a filter for the summary document


            {
              exists: {
                field: 'summary',
              },
            },

We can also update the existing filter to look for summary.down count

            {
              range: {
                'summary.down': {
                  gt: '0',
                },
              },
            },

Both solutions should be fine.

This is the config i used for reproducing this


- type: browser
  id: cnn-monitor
  name: cnn-monitor
  schedule: '@every 1m'
  source:
    inline:
      script: |-
        step('load homepage', async () => {
        await page.goto('https://www.elastic.co');
              });
        step('hover over products menu', async () => {
        const cookieBanner = await page.$('#iubenda-cs-banner');
        await page.waitForTimeout(30*1000);
        await page.hover('css=[44data-nav-item=products]');
          });
elasticmachine commented 2 years ago

Pinging @elastic/uptime (Team:uptime)

dominiqueclarke commented 2 years ago

Thanks for finding this. It's interesting to me that the issue here is similar to the one I reported about the last successful screenshot. Seems we should be filtering for documents with heartbeat/summary docs in more places.