Open cjcenizal opened 2 years ago
Pinging @elastic/platform-deployment-management (Team:Deployment Management)
The problem can also occur when manually executing a Watch via Dev Tools and a timeout occurs (because the report takes more time than the Kibana timeout):
POST _watcher/watch/<watchID>/_execute
Hi @jeanfabrice, could you please provide some more specific instructions for reproducing the bug (e.g. how can it be reproduced on Elastic Cloud, what URL should be used in the reporting section of the json, etc.)? Thanks!
Hello @ElenaStoeva - I checked this with @jeanfabrice and the way to reproduce should be:
{
"trigger": {
"schedule": {
"interval": "80000m"
}
},
"input": {
"none": {}
},
"condition": {
"always": {}
},
"actions": {
"email_admin": {
"email": {
"profile": "standard",
"attachments": {
"report.pdf": {
"reporting": {
"url": "PASTE HERE THE URL YOU GOT AT STEP 5",
"retries": 1,
"interval": "300s",
"auth": {
"basic": {
"username": "elastic",
"password": "TYPE HERE THE elastic PASSWORD"
}
}
}
}
},
"from": "TYPE HERE a valid email allowed in your Elastic cloud account",
"to": [
"TYPE HERE a valid email allowed in your Elastic cloud account"
],
"subject": "Elasticsearch - test email watcher",
"body": {
"text": "test"
}
}
}
}
}
You should get multiple emails or in general you might see multiple reports being generated in Kibana / Reporting (but you'll need to log in as elastic user to see them - do not use SSO). Note this is not strictly related to Elastic Cloud.
@sabarasaba and I met today to reproduce the bug and brainstorm what the cause might be. We followed the instructions above in Elastic Cloud on staging and we observed behaviour that is similar to the described in the issue - sometimes we would get one generated report with no emails, sometimes we received just one email, and other times we received multiple emails which were sent with some delay (with a couple of minutes between every two consecutive emails).
Upon checking the Watcher UI codebase, we saw that the watch execute/simulate requests that it sends to Es contain no looping and no retries after failure. Also, the Watcher codebase doesn't contain any logic about reporting at all, it just sends the request to Es with the json as it was provided by the user. Given that and the fact that this problem also occurs when the watch is executed manually in Dev Console, we think that this bug is either related to the reporting feature or it might be caused by Es retrying the watch execute requests if they fail.
The reporting feature is owned by @elastic/appex-sharedux. @tsullivan, I was told that you are in this team, would you be able to take a look and let us know if you think that this is related to the reporting feature?
Regarding the other possibility (Es retrying the failing watcher execute request), @jakelandis, would you be able to to refer us to someone who works on the Watcher APIs?
I've not digged enough on this but it might be possible to enable the APM Agent features in Kibana and track the requests being executed. Maybe we can get a better picture of what's happening behind the scenes.
Regarding the other possibility (Es retrying the failing watcher execute request), @jakelandis, would you be able to to refer us to someone who works on the Watcher APIs?
@masseyke - can you help out here ?
Hi @elastic/appex-sharedux and @masseyke, just following up on this, would you be able to take a look and see if the issue is related to the Reporting feature or the Watcher APIs?
Sorry for the delay. I'm trying to reproduce it now, but it's not failing for me (at least not in the way described -- I just get a single failure due to 552 5.3.4 Error: message file too big
. I'll try with some smaller data.
Hmm after making the dashboard small enough that the PDF would be accepted by the email system, I'm now getting one email for each execute call I make. It takes about 5 minutes, so I think it's pretty similar to what you were seeing. But I'm not getting the "report storm". Are you still seeing this with the latest code? Did we accidentally fix this?
Since I haven't been able to reproduce the problem, I looked into the code. I do not see any overall timeout/retry logic in watcher. There is retry logic if the PDF download fails (up to 40 times by default, based on xpack.notification.reporting.retries
). But that would only result in at most a single email being sent. We don't retry emaimls, but It's possible that the email server sends them multiple times. That seems pretty unlikely though. I'll keep trying to reproduce this, but I think I'm missing something.
Thank you @masseyke Are you using a multiple nodes in the cluster? My suspect is the retry is somehow done from Kibana Elasticsearch client when calling the Watcher API There is a linked internal issue above 2867 where we even enable the HTTP Tracing to spot if the call is done multiple times and it seems they are effectively called.
Are you using a multiple nodes in the cluster?
It's a 3-node elasticsearch (I just took the cloud defaults, and increased the kibana instance to 8 GB). Think I need more? And yeah I saw from the log above that it looked like something called the watcher exactly every 30 seconds on a different node. That does look like a client doing retries after a 30-second timeout.
@ElenaStoeva
Upon checking the Watcher UI codebase, we saw that the watch execute/simulate requests that it sends to Es contain no looping and no retries after failure.
I'm double-checking this and looking at the code that handles requests in Kibana from the Watcher UI. It believe the call to ES is here: https://github.com/elastic/kibana/blob/a02c00b/x-pack/plugins/watcher/server/routes/api/watch/register_execute_route.ts#L29-L33. The watcher.executeWatch
method of the ES API can take an optional 2nd TransportRequestOptionsWithOutMeta
argument. To ensure the call to ES never retries after failure, you can modify the call to something like:
return dataClient.asCurrentUser.watcher
.executeWatch(
{ body },
{ maxRetries: 0 }
)
.then((returnValue) => returnValue);
I couldn't find any documentation about this, but it seems the default value for TransportRequestOptions.maxRetries
is 3. I think that could explain the report storm.
I tried reproducing the problem and was also not able to. When clicking the simulate button in Watcher UI in my local development environment, I got an Error of Cannot simulate watch [1:793] [reporting_attachment] failed to parse field [auth]
-- but I'm not seeing where my watch definition has any errors.
I looked into this issue once again and reproduced it a couple of times in QA. When simulating the watch, I get a backend closed connection
error in the browser console and then I receive 2 to 5 emails (the number is different every time) in a 1-hour timeframe.
Note that at the Simulate tab I need to select the "execute" mode instead of "simulate" (otherwise I would not receive any emails):
Unfortunately, I'm not able to reproduce this locally (probably because of the lack of an email server), so I'm not able to check if your proposed solution, @tsullivan, (setting maxRetries
to 1) resolves the problem. However, I noticed that when I simulate the watch in a locally running Kibana, I see the following error in the logs:
Error: JSON argument must contain a watchStatusJson property
at buildServerWatchStatusModel (watch_status_model.ts:36:21)
at new WatchHistoryItem (watch_history_item.js:27:51)
at Function.fromUpstreamJson (watch_history_item.js:84:12)
at register_execute_route.ts:65:51
at runMicrotasks (<anonymous>)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at Router.handle (router.ts:207:30)
at handler (router.ts:162:13)
at exports.Manager.execute (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/toolkit.js:60:28)
at Object.internals.handler (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:46:20)
at exports.execute (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/handler.js:31:20)
at Request._lifecycle (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:371:32)
at Request._execute (/Users/elenastoeva/elastic/kibana/node_modules/@hapi/hapi/lib/request.js:281:9)
I'm not sure if this error is related to the report storm, I will investigate it further.
Pinging @elastic/kibana-management (Team:Kibana Management)
@jeanfabrice reported this bug. He was able to reproduce on 8.1.3 and 8.2.0. He created a watch that sends an email and generates a report from a dashboard that contains a simple visualization. Watcher retries and interval were respectively 1 and 300s.
Kibana logs evidences that shows Kibana keeps scheduling a report task when running a single simulate:
ES logs: