Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.37k stars 1.06k forks source link

Loading recent message for extractor fails with warning about too many shards #4510

Closed lennartkoopmann closed 6 years ago

lennartkoopmann commented 6 years ago

screenshot from 2018-01-23 20-06-49

Loading a message to create an extractor fails with the following error:

Unable to perform search query Trying to query 2280 shards, which is over the limit of 1000. This limit exists because querying many shards at the same time can make the job of the coordinating node very CPU and/or memory intensive. It is usually a better idea to have a smaller number of larger shards. Update [action.search.shard_count.limit] to a greater value if you really want to query that many shards at the same time.

The URL looks like no time range is applied to the query. I think it should be bound to only search in the last 1 or 2 hours or so.

Your Environment

joschi commented 6 years ago

The RecentMessageLoader should already be restricted to the last hour (as the help text mentions), also see #3367. https://github.com/Graylog2/graylog2-server/blob/0f97411137b2f5b36f3a24d0a74c409ef469df39/graylog2-web-interface/src/components/messageloaders/RecentMessageLoader.jsx#L28-L29

@lennartkoopmann Are the index ranges in your setup up-to-date and do the indices in your setup contain messages which are older than 1 hour?

bernd commented 6 years ago

Fixed in 2.4 via #4514. The fix for master is not done yet so I am leaving the issue open but move it into the 3.0 milestone.

J-Camping commented 6 years ago

I noticed that in 2.4.3 editing GROK extractors, and show recent messages for inputs don't have a range limit on queries so my elasticsearch cluster chokes on trying to look up the messages. Can those searches also be modified to limit the amount of data being requested from ES?

edmundoa commented 6 years ago

@TheJCamping we would really appreciate if you could provide some more details about the issue you are facing, namely:

  1. In which page and part of the application exactly are you seeing the issue? If it's hard to describe you can upload a screenshot.
  2. Are there any errors in your browser's developer console and/or server logs when the issue occurs?

Thank you in advance.

J-Camping commented 6 years ago

@edmundoa

On the /system/inputs page selecting "show recent messages" for an input will result in a page that never loads or an error message: Error Message: Unable to perform search query Details: Search status code: 500 Search response: cannot GET http://192.168.13.37:12900/search/universal/relative?query=gl2_source_input%3A54863813a78e39c792b058d1&range=0&limit=150&sort=timestamp%3Adesc (500)

In the developer tools for chrome. the network page shows a request to this address: http://graylog-01:12900/search/universal/relative?query=gl2_source_input%3A54863813a78e39c792b058d1&range=0&limit=150&sort=timestamp%3Adesc times out. If I change the range to 3000, I can get that request to load.

A similar thing happens when I go to edit an extractor and it tries to load an example message. It goes to a spinning wheel that says "Loading..." . Example of what shows in the network tab: 2018-01-26 10_27_43-2018-01-26 10_25_52-graylog2

In both instances, trying to search the entire elasticsearch cluster becomes bogged down trying to search without a range limit. I know this is partially a lack of resources for the elasticsearch cluster. I have recent (30 days) indexes on fast storage and older ones on slower storage.

Cluster Info: 101 indices with ~5 billion messages using 9.6tb on Elasticsearch 5.6.6 and Graylog 2.4.3

edmundoa commented 6 years ago

@TheJCamping Thank you for the detailed report!

The first issue you mentioned is only partially related to this one, so I opened https://github.com/Graylog2/graylog2-server/issues/4533 and I kindly ask you to continue the conversation in there.

The second issue your previous comment seems to be the same one reported in here, so I will reopen the issue so that we can take a look again.

First of all, could you please try editing the extractor in another browser? I know it sounds silly, but I just want to see if for some reason your web interface is running some cached version of the code.

J-Camping commented 6 years ago

@edmundoa

I just tried it in Firefox as well as Chrome on another system. This also happening with all extractors on all inputs.

Thank you

dennisoelkers commented 6 years ago

@TheJCamping We fixed this issue (loading a message before creating an extractor) for 2.4.3. Did you restart your server/reload your web interface before trying again? I have just verified that it works reliably in 2.4.3, so I am closing this issue for now. If you are sure that your server and web interface are up to date (including a full page refresh of the web interface), then reopen the issue please.

J-Camping commented 6 years ago

@dennisoelkers

I am sure that the server and web interface are up to date, I have restart the server again just to be sure. All of my nodes are at 2.4.3. I have tried on 3 separate systems with 2 different browsers and incognito mode as well.

The error message I posted earlier I think shows that RecentMessageLoader is using relative, not range which is the change that was made here right? https://github.com/Graylog2/graylog2-server/pull/4513/

Please let me know if there is any other info I can provide.

edmundoa commented 6 years ago

@TheJCamping that's not the change exactly. The URL in the information you provided goes to the right endpoint (it should use /search/universal/relative, but it should also contain a range=3600 query parameter, which is missing in your case, and which caused the initial issue as well.

When you updated your Graylog setup, did you also update all plugins bundled with it? It would be really helpful if you could share the exact version of each plugin in your system (an ls in the plugins directory should be enough for that).

Thank you in advance!

J-Camping commented 6 years ago

@edmundoa

Thank you for the explanation.

I am using the yum packages to update Graylog.

Here is the result from the ls: graylog-plugin-beats-2.4.3.jar
graylog-plugin-collector-2.4.3.jar
graylog-plugin-map-widget-2.4.3.jar
graylog-plugin-pipeline-processor-2.4.3.jar graylog-plugin-aws-2.4.3.jar
graylog-plugin-cef-2.4.3.jar
graylog-plugin-enterprise-integration-2.4.3.jar
graylog-plugin-netflow-2.4.3.jar
graylog-plugin-threatintel-2.4.3.jar

edmundoa commented 6 years ago

@TheJCamping

To be sure I'm executing the same code as you do, I downloaded the OVA image for Graylog 2.4.3 and I could not reproduce the issue. I have tried in a couple of browsers and when I load a recent message from an input in the extractors page, the URL to ask for the message looks right:

Request URL: http://graylog:9000/api/search/universal/relative?query=gl2_source_input%3A5a719c11d73f9505ef1832d6%20OR%20gl2_source_radio_input%3A5a719c11d73f9505ef1832d6&range=3600&limit=1&decorate=false

As you can see it contains the range=3600 query parameter which will limit the search to the last hour.

Could you please provide some more details about the setups where you find this issue? I'm specially interested in the number of nodes and also if there is any load balancer or proxy in between.

J-Camping commented 6 years ago

@edmundoa

I just download the OVA onto my system to verify. I created a new input TCP Syslog and started sending traffic to it. I created a GROK extractor and then went to edit the extractor. It worked since the Elasticsearch cluster only has a few thousand messages, but the requesting URL was this:

'http://192.168.15.46:9000/api/search/universal/relative?query=gl2_source_input%3A5a7285c3e7a1d606088725f7%20OR%20gl2_source_radio_input%3A5a7285c3e7a1d606088725f7&limit=1'

Still no range in the request.

Were you editing a GROK extractor?

To answer your previous question, I have 3 nodes in my Graylog cluster and no load balancer or proxy in front of the web interfaces.

edmundoa commented 6 years ago

@TheJCamping That makes it clear the issue. I could reproduce it now, and it's related to this one but slightly different. This is why we didn't see it until now. I opened another issue for it to avoid confusion, feel free to add any comments in there: #4553. I'll also close this issue now.

Thank you!