Simulate Watcher in Kibana produces a report storm until it times out

cjcenizal commented 2 years ago

@jeanfabrice reported this bug. He was able to reproduce on 8.1.3 and 8.2.0. He created a watch that sends an email and generates a report from a dashboard that contains a simple visualization. Watcher retries and interval were respectively 1 and 300s.

{
  "trigger": {
    "schedule": {
      "interval": "80000m"
    }
  },
  "input": {
    "none": {}
  },
  "condition": {
    "always": {}
  },
  "actions": {
    "email_admin": {
      "email": {
        "profile": "standard",
        "attachments": {
          "report.pdf": {
            "reporting": {
              "url": "https://<redacted>/api/reporting/generate/printablePdfV2?jobParams=%28browserTimezone%3AEurope%2FParis%2Clayout%3A%28dimensions%3A%28height%3A428%2Cwidth%3A1663%29%2Cid%3Apreserve_layout%29%2ClocatorParams%3A%21%28%28id%3ADASHBOARD_APP_LOCATOR%2Cparams%3A%28dashboardId%3A%270748a850-cae0-11ec-a61c-1f1c1eae0109%27%2CpreserveSavedFilters%3A%21t%2CtimeRange%3A%28from%3Anow-15m%2Cto%3Anow%29%2CuseHash%3A%21f%2CviewMode%3Aview%29%2Cversion%3A%278.1.3%27%29%29%2CobjectType%3Adashboard%2Ctitle%3Amy-dashboard%2Cversion%3A%278.1.3%27%29",
              "retries": 1,
              "interval": "300s",
              "auth": {
                "basic": {
                  "username": "reporter",
                  "password": "::es_redacted::"
                }
              }
            }
          }
        },
        "from": "<redacted>",
        "to": [
          "<redacted>"
        ],
        "subject": "Elasticsearch - test email watcher",
        "body": {
          "text": "test"
        }
      }
    }
  }
}

Kibana logs evidences that shows Kibana keeps scheduling a report task when running a single simulate:

kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:14.291+00:00][INFO ][plugins.reporting] Scheduled PDF reporting task. Task ID: task:f7115190-caed-11ec-8769-dbd785ba6f89. Report ID: l2q934zr00075e3f4bf2u7ya
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:15.531+00:00][INFO ][plugins.screenshotting.screenshot.browser-driver] Creating browser page driver
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:16.167+00:00][INFO ][plugins.screenshotting.screenshot] opening url http://localhost:5601/app/reportingRedirect?forceNow=2022-05-03T14%3A33%3A14.231Z
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:16.654+00:00][ERROR][plugins.screenshotting.screenshot.browser-driver.headless-browser-console] Error in browser console: { message: "Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'unsafe-eval' 'self'". Either the 'unsafe-inline' keyword, a hash ('sha256-P5polb1UreUSOe5V/Pv7tc+yeZuJXiOi/3fqhGsU7BE='), or a nonce ('nonce-...') is required to enable inline execution.
kibana-kb-74f594cc89-2ghd6 kibana ", url: "http://localhost:5601/app/reportingRedirect?forceNow=2022-05-03T14%3A33%3A14.231Z" }
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:31.039+00:00][INFO ][plugins.screenshotting.screenshot] handled 151 page requests
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:36.639+00:00][INFO ][plugins.screenshotting.screenshot] timeRange: May 3, 2022 @ 16:18:33.454 to May 3, 2022 @ 16:33:33.454
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:36.640+00:00][INFO ][plugins.screenshotting.screenshot] taking screenshots
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:37.480+00:00][INFO ][plugins.screenshotting.screenshot] screenshots taken: 1
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:37.736+00:00][INFO ][plugins.reporting.runTask.printablePdfV2.printable_pdf_v2.execute-job.l2q934zr00075e3f4bf2u7ya] Compiling PDF using "preserve_layout" layout...
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:33:38.990+00:00][INFO ][plugins.reporting.runTask] Saved printable_pdf_v2 job /.reporting-2022-05-01/_doc/l2q934zr00075e3f4bf2u7ya
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:14.837+00:00][INFO ][plugins.reporting] Scheduled PDF reporting task. Task ID: task:1b1b1670-caee-11ec-8769-dbd785ba6f89. Report ID: l2q94fim00075e3f4b3f1lkb
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:15.864+00:00][INFO ][plugins.screenshotting.screenshot.browser-driver] Creating browser page driver
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:16.544+00:00][INFO ][plugins.screenshotting.screenshot] opening url http://localhost:5601/app/reportingRedirect?forceNow=2022-05-03T14%3A34%3A14.345Z
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:16.749+00:00][ERROR][plugins.screenshotting.screenshot.browser-driver.headless-browser-console] Error in browser console: { message: "Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'unsafe-eval' 'self'". Either the 'unsafe-inline' keyword, a hash ('sha256-P5polb1UreUSOe5V/Pv7tc+yeZuJXiOi/3fqhGsU7BE='), or a nonce ('nonce-...') is required to enable inline execution.
kibana-kb-74f594cc89-2ghd6 kibana ", url: "http://localhost:5601/app/reportingRedirect?forceNow=2022-05-03T14%3A34%3A14.345Z" }
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:29.837+00:00][INFO ][plugins.screenshotting.screenshot] handled 151 page requests
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:37.912+00:00][INFO ][plugins.screenshotting.screenshot] timeRange: May 3, 2022 @ 16:19:35.338 to May 3, 2022 @ 16:34:35.339
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:37.914+00:00][INFO ][plugins.screenshotting.screenshot] taking screenshots
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:38.456+00:00][INFO ][plugins.screenshotting.screenshot] screenshots taken: 1
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:38.653+00:00][INFO ][plugins.reporting.runTask.printablePdfV2.printable_pdf_v2.execute-job.l2q94fim00075e3f4b3f1lkb] Compiling PDF using "preserve_layout" layout...
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:39.936+00:00][INFO ][plugins.reporting.runTask] Saved printable_pdf_v2 job /.reporting-2022-05-01/_doc/l2q94fim00075e3f4b3f1lkb
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:44.266+00:00][INFO ][plugins.reporting] Scheduled PDF reporting task. Task ID: task:2cb2b820-caee-11ec-8769-dbd785ba6f89. Report ID: l2q952dz00075e3f4be0lioo
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:45.961+00:00][INFO ][plugins.screenshotting.screenshot.browser-driver] Creating browser page driver
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:46.637+00:00][INFO ][plugins.screenshotting.screenshot] opening url http://localhost:5601/app/reportingRedirect?forceNow=2022-05-03T14%3A34%3A44.169Z
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:34:46.943+00:00][ERROR][plugins.screenshotting.screenshot.browser-driver.headless-browser-console] Error in browser console: { message: "Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'unsafe-eval' 'self'". Either the 'unsafe-inline' keyword, a hash ('sha256-P5polb1UreUSOe5V/Pv7tc+yeZuJXiOi/3fqhGsU7BE='), or a nonce ('nonce-...') is required to enable inline execution.
kibana-kb-74f594cc89-2ghd6 kibana ", url: "http://localhost:5601/app/reportingRedirect?forceNow=2022-05-03T14%3A34%3A44.169Z" }
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:02.850+00:00][INFO ][plugins.screenshotting.screenshot] handled 150 page requests
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:09.640+00:00][INFO ][plugins.screenshotting.screenshot] timeRange: May 3, 2022 @ 16:20:05.739 to May 3, 2022 @ 16:35:05.739
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:09.642+00:00][INFO ][plugins.screenshotting.screenshot] taking screenshots
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:10.375+00:00][INFO ][plugins.screenshotting.screenshot] screenshots taken: 1
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:10.537+00:00][INFO ][plugins.reporting.runTask.printablePdfV2.printable_pdf_v2.execute-job.l2q952dz00075e3f4be0lioo] Compiling PDF using "preserve_layout" layout...
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:11.666+00:00][INFO ][plugins.reporting.runTask] Saved printable_pdf_v2 job /.reporting-2022-05-01/_doc/l2q952dz00075e3f4be0lioo
kibana-kb-74f594cc89-2ghd6 kibana [2022-05-03T14:35:14.142+00:00][ERROR][http] TimeoutError: Request timed out
kibana-kb-74f594cc89-2ghd6 kibana     at KibanaTransport.request (/usr/share/kibana/node_modules/@elastic/transport/lib/Transport.js:524:31)
kibana-kb-74f594cc89-2ghd6 kibana     at runMicrotasks (<anonymous>)
kibana-kb-74f594cc89-2ghd6 kibana     at processTicksAndRejections (node:internal/process/task_queues:96:5)
kibana-kb-74f594cc89-2ghd6 kibana     at KibanaTransport.request (/usr/share/kibana/src/core/server/elasticsearch/client/create_transport.js:64:16)
kibana-kb-74f594cc89-2ghd6 kibana     at Watcher.executeWatch (/usr/share/kibana/node_modules/@elastic/elasticsearch/lib/api/api/watcher.js:145:16)
kibana-kb-74f594cc89-2ghd6 kibana     at /usr/share/kibana/x-pack/plugins/watcher/server/routes/api/watch/register_execute_route.js:65:19
kibana-kb-74f594cc89-2ghd6 kibana     at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:163:30)
kibana-kb-74f594cc89-2ghd6 kibana     at handler (/usr/share/kibana/src/core/server/http/router/router.js:124:50)
kibana-kb-74f594cc89-2ghd6 kibana     at exports.Manager.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
kibana-kb-74f594cc89-2ghd6 kibana     at Object.internals.handler (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)
kibana-kb-74f594cc89-2ghd6 kibana     at exports.execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)
kibana-kb-74f594cc89-2ghd6 kibana     at Request._lifecycle (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
kibana-kb-74f594cc89-2ghd6 kibana     at Request._execute (/usr/share/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)

ES logs:

elasticsearch-es-hot-0 elasticsearch [2022-05-04T08:35:46,339][DEBUG][o.e.x.w.WatcherIndexingListener] [elasticsearch-es-hot-0] adding watch [18567d40-a3b5-4548-9381-399e4babfc25] to trigger service
elasticsearch-es-hot-1 elasticsearch [2022-05-04T08:35:46,337][DEBUG][o.e.x.w.WatcherIndexingListener] [elasticsearch-es-hot-1] watch [18567d40-a3b5-4548-9381-399e4babfc25] should not be triggered. watcher state [STARTED]
elasticsearch-es-master-0 elasticsearch [2022-05-04T08:36:50,126][DEBUG][o.e.x.w.e.ExecutionService] [elasticsearch-es-master-0] executing watch [_inlined_]
elasticsearch-es-hot-0 elasticsearch [2022-05-04T08:37:20,127][DEBUG][o.e.x.w.e.ExecutionService] [elasticsearch-es-hot-0] executing watch [_inlined_]
elasticsearch-es-ingest-0 elasticsearch [2022-05-04T08:37:50,130][DEBUG][o.e.x.w.e.ExecutionService] [elasticsearch-es-ingest-0] executing watch [_inlined_]
elasticsearch-es-warm-1 elasticsearch [2022-05-04T08:38:20,135][DEBUG][o.e.x.w.e.ExecutionService] [elasticsearch-es-warm-1] executing watch [_inlined_]

elasticmachine commented 2 years ago

Pinging @elastic/platform-deployment-management (Team:Deployment Management)

lucabelluccini commented 2 years ago

The problem can also occur when manually executing a Watch via Dev Tools and a timeout occurs (because the report takes more time than the Kibana timeout):

POST _watcher/watch/<watchID>/_execute

ElenaStoeva commented 1 year ago

Hi @jeanfabrice, could you please provide some more specific instructions for reproducing the bug (e.g. how can it be reproduced on Elastic Cloud, what URL should be used in the reporting section of the json, etc.)? Thanks!

lucabelluccini commented 1 year ago

Hello @ElenaStoeva - I checked this with @jeanfabrice and the way to reproduce should be:

Create a multi data node cluster in Elastic Cloud and a Kibana instance of 8GB (it is important for PDF Reports to run smoothly)
Log in Kibana
Install the sample Log data
Go in a dashboard or canvas of the sample data
Get the PDF report URL (Share / Report / PDF / Get POST URL)
Go in Kibana / Stack Management / Watcher / Create Advanced Watch
Fill in the following

Go on simulate

{
"trigger": {
"schedule": {
  "interval": "80000m"
}
},
"input": {
"none": {}
},
"condition": {
"always": {}
},
"actions": {
"email_admin": {
  "email": {
    "profile": "standard",
    "attachments": {
      "report.pdf": {
        "reporting": {
          "url": "PASTE HERE THE URL YOU GOT AT STEP 5",
          "retries": 1,
          "interval": "300s",
          "auth": {
            "basic": {
              "username": "elastic",
              "password": "TYPE HERE THE elastic PASSWORD"
            }
          }
        }
      }
    },
    "from": "TYPE HERE a valid email allowed in your Elastic cloud account",
    "to": [
      "TYPE HERE a valid email allowed in your Elastic cloud account"
    ],
    "subject": "Elasticsearch - test email watcher",
    "body": {
      "text": "test"
    }
  }
}
}
}

You should get multiple emails or in general you might see multiple reports being generated in Kibana / Reporting (but you'll need to log in as elastic user to see them - do not use SSO). Note this is not strictly related to Elastic Cloud.

ElenaStoeva commented 1 year ago

@sabarasaba and I met today to reproduce the bug and brainstorm what the cause might be. We followed the instructions above in Elastic Cloud on staging and we observed behaviour that is similar to the described in the issue - sometimes we would get one generated report with no emails, sometimes we received just one email, and other times we received multiple emails which were sent with some delay (with a couple of minutes between every two consecutive emails).

Upon checking the Watcher UI codebase, we saw that the watch execute/simulate requests that it sends to Es contain no looping and no retries after failure. Also, the Watcher codebase doesn't contain any logic about reporting at all, it just sends the request to Es with the json as it was provided by the user. Given that and the fact that this problem also occurs when the watch is executed manually in Dev Console, we think that this bug is either related to the reporting feature or it might be caused by Es retrying the watch execute requests if they fail.

The reporting feature is owned by @elastic/appex-sharedux. @tsullivan, I was told that you are in this team, would you be able to take a look and let us know if you think that this is related to the reporting feature?

Regarding the other possibility (Es retrying the failing watcher execute request), @jakelandis, would you be able to to refer us to someone who works on the Watcher APIs?

lucabelluccini commented 1 year ago

I've not digged enough on this but it might be possible to enable the APM Agent features in Kibana and track the requests being executed. Maybe we can get a better picture of what's happening behind the scenes.

jakelandis commented 1 year ago

Regarding the other possibility (Es retrying the failing watcher execute request), @jakelandis, would you be able to to refer us to someone who works on the Watcher APIs?

@masseyke - can you help out here ?

ElenaStoeva commented 1 year ago

Hi @elastic/appex-sharedux and @masseyke, just following up on this, would you be able to take a look and see if the issue is related to the Reporting feature or the Watcher APIs?

masseyke commented 1 year ago

Sorry for the delay. I'm trying to reproduce it now, but it's not failing for me (at least not in the way described -- I just get a single failure due to 552 5.3.4 Error: message file too big. I'll try with some smaller data.

masseyke commented 1 year ago

Hmm after making the dashboard small enough that the PDF would be accepted by the email system, I'm now getting one email for each execute call I make. It takes about 5 minutes, so I think it's pretty similar to what you were seeing. But I'm not getting the "report storm". Are you still seeing this with the latest code? Did we accidentally fix this?

masseyke commented 1 year ago

Since I haven't been able to reproduce the problem, I looked into the code. I do not see any overall timeout/retry logic in watcher. There is retry logic if the PDF download fails (up to 40 times by default, based on xpack.notification.reporting.retries). But that would only result in at most a single email being sent. We don't retry emaimls, but It's possible that the email server sends them multiple times. That seems pretty unlikely though. I'll keep trying to reproduce this, but I think I'm missing something.

lucabelluccini commented 1 year ago

Thank you @masseyke Are you using a multiple nodes in the cluster? My suspect is the retry is somehow done from Kibana Elasticsearch client when calling the Watcher API There is a linked internal issue above 2867 where we even enable the HTTP Tracing to spot if the call is done multiple times and it seems they are effectively called.

masseyke commented 1 year ago

Are you using a multiple nodes in the cluster?

It's a 3-node elasticsearch (I just took the cloud defaults, and increased the kibana instance to 8 GB). Think I need more? And yeah I saw from the log above that it looked like something called the watcher exactly every 30 seconds on a different node. That does look like a client doing retries after a 30-second timeout.

tsullivan commented 1 year ago

@ElenaStoeva

Upon checking the Watcher UI codebase, we saw that the watch execute/simulate requests that it sends to Es contain no looping and no retries after failure.

I'm double-checking this and looking at the code that handles requests in Kibana from the Watcher UI. It believe the call to ES is here: https://github.com/elastic/kibana/blob/a02c00b/x-pack/plugins/watcher/server/routes/api/watch/register_execute_route.ts#L29-L33. The watcher.executeWatch method of the ES API can take an optional 2nd TransportRequestOptionsWithOutMeta argument. To ensure the call to ES never retries after failure, you can modify the call to something like:

  return dataClient.asCurrentUser.watcher
    .executeWatch(
      { body },
      { maxRetries: 0 }
    )
    .then((returnValue) => returnValue);

I couldn't find any documentation about this, but it seems the default value for TransportRequestOptions.maxRetries is 3. I think that could explain the report storm.

I tried reproducing the problem and was also not able to. When clicking the simulate button in Watcher UI in my local development environment, I got an Error of Cannot simulate watch [1:793] [reporting_attachment] failed to parse field [auth] -- but I'm not seeing where my watch definition has any errors.

ElenaStoeva commented 1 year ago

I looked into this issue once again and reproduced it a couple of times in QA. When simulating the watch, I get a backend closed connection error in the browser console and then I receive 2 to 5 emails (the number is different every time) in a 1-hour timeframe.

Note that at the Simulate tab I need to select the "execute" mode instead of "simulate" (otherwise I would not receive any emails):

Unfortunately, I'm not able to reproduce this locally (probably because of the lack of an email server), so I'm not able to check if your proposed solution, @tsullivan, (setting maxRetries to 1) resolves the problem. However, I noticed that when I simulate the watch in a locally running Kibana, I see the following error in the logs:

Error: JSON argument must contain a watchStatusJson property
    at buildServerWatchStatusModel (watch_status_model.ts:36:21)
    at new WatchHistoryItem (watch_history_item.js:27:51)
    at Function.fromUpstreamJson (watch_history_item.js:84:12)
    at register_execute_route.ts:65:51
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Router.handle (router.ts:207:30)
    at handler (router.ts:162:13)
    at exports.Manager.execute (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
    at Object.internals.handler (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)
    at exports.execute (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)
    at Request._lifecycle (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
    at Request._execute (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)

I'm not sure if this error is related to the report storm, I will investigate it further.

elasticmachine commented 1 month ago

Pinging @elastic/kibana-management (Team:Kibana Management)

elastic / kibana

Simulate Watcher in Kibana produces a report storm until it times out #132267