Closed lchaubert closed 8 years ago
My thoughts so far is to use the rest api to query results with an absolute time range (use whatever scripting language you will use to auto generated the search time stamps) and see if the results are empty, as they would be when searching via web interface. If they are empty, have a condition to execute an index range rebuild.
Ideally the big would be fixed instead. There are quite a few issues opened on it and some supposed fixes but nothing seems to correct this so far.
Took a stab at a proof of concept that I just tested with positive results. Sorry it is crude!
#!/bin/bash
graylogServerHttp="http"
graylogServerHostName="graylog.fullqualified.name"
graylogServerPort="12900"
triggerAlert=0
indexrangeresult=`curl --user username:password "$graylogServerHttp"://"$graylogServerHostName":"$graylogServerPort"/search/universal/keyword?query=*\&keyword=last%201%20hour\&limit=100 | grep -oP "\"total_results\":\d+" | grep -oP "[0-9]+"`
if [ $? -eq 1 ]; then
echo "graylog-server not running"
else
#echo "Elasticsearch is running."
if [ $indexrangeresult == "0" ]; then
echo "range empty, recalc"
triggerAlert=1
else
echo "range ok"
fi
fi
if [ $triggerAlert -eq 1 ]; then
echo "Running curl to recalculate ranges"
curl -X POST --user username:password "$graylogServerHttp"://"$graylogServerHostName":"$graylogServerPort"/system/indices/ranges/rebuild
fi
The index ranges (which are really just the min and max timestamps of messages in this index) usually don't change once they've been calculated when an index is rotated.
Either you're modifying those indices externally (which you really shouldn't do) or this is another bug that we couldn't reproduce until now. In the latter case, recalculating the index ranges simply increases load on your Elasticsearch cluster and doesn't help finding or fixing the root cause for the problem you have.
I'll positive myself or others who have reported the same issue are not externally modifying those ranges. It seems to be most prevalent when a new index is created, such as when the max number of messages for that index is reached. Is there anything in the server log that could be helpful or is there a more verbose level of logging that I could enable?
@drewmmiranda Thank you very much for your answer and for your script, it's gonning to help us a lot. It is indeed probably the only way to do it for now. @joschi as for drewmmiranda we didn't play at all with the indexes.
@joschi I've been keeping an eye on how often this script fires and manually recalculates the index ranges. There does seem to be a pattern. About every 6 hours it detects the range is out of sync.
Time | Action |
---|---|
2015-11-29 11:37:11.995 -06:00 | Retention Strategy Run |
2015-11-29 11:50:02.322 -06:00 | Re-calculating index ranges. (unsure if this is from retention strategy or not |
2015-11-29 18:00:01.801 -06:00 | Re-calculating index ranges. (Script ran) |
2015-11-30 00:10:01.418 -06:00 | Re-calculating index ranges. (Script ran) |
2015-11-30 06:20:01.690 -06:00 | Re-calculating index ranges. (Script ran) |
Any ideas? It is probably a coincidence that my timezone offset is also 6 hours.
@drewmmiranda How often are your indices rotated (i. e. which index_retention_strategy
and index_rotation_strategy
has been configured)? How long do the index range calculations take (you'll find that information in your logs and in the web interface on the System -> Indices page)?
@joschi
I also notice inconsistent ranges between indexes where an index contains messages from a larger time frame that it actuallyshould (e.g. index 114 from a month ago [30days?] to 4 days ago)
index | Messages Range | Size/Count |
---|---|---|
gl2cluster_116 | Contains messages up to a few seconds ago | 5.2GiB / 8,072,963 messages |
gl2cluster_115 | Contains messages from 19 days ago up to 18 hours ago | 11.1GiB / 20,000,602 messages |
gl2cluster_114 | Contains messages from a month ago up to 4 days ago | 12.2GiB / 20,000,054 messages |
gl2cluster_113 | Contains messages from a month ago up to 6 days ago | 16.3GiB / 20,002,007 messages |
gl2cluster_112 | Contains messages from 10 months ago up to 10 days ago | 12.7GiB / 20,001,265 messages |
gl2cluster_111 | Contains messages from 2 months ago up to 12 days ago | 13.0GiB / 20,000,619 messages |
gl2cluster_110 | Contains messages from 2 months ago up to 14 days ago | 11.5GiB / 20,000,581 messages |
gl2cluster_109 | Contains messages from 2 months ago up to 17 days ago | 12.9GiB / 20,002,461 messages |
gl2cluster_108 | Contains messages from a month ago up to 19 days ago | 16.2GiB / 20,002,734 messages |
gl2cluster_107 | Contains messages from 2 months ago up to 22 days ago | 17.2GiB / 20,001,681 messages |
gl2cluster_106 | Contains messages from a month ago up to 24 days ago | 12.9GiB / 20,000,486 messages |
gl2cluster_105 | Contains messages from 2 months ago up to a month ago | 16.2GiB / 20,000,502 messages |
gl2cluster_104 | Contains messages from a month ago up to a month ago | 13.8GiB / 20,000,408 messages |
gl2cluster_103 | Contains messages from 8 years ago up to a month ago | 10.9GiB / 20,001,321 messages |
gl2cluster_102 | Contains messages from 2 months ago up to a month ago | 12.1GiB / 20,000,411 messages |
gl2cluster_101 | Contains messages from 2 months ago up to a month ago | 10.1GiB / 20,000,555 messages |
gl2cluster_100 | Contains messages from 2 months ago up to a month ago | 10.2GiB / 20,000,211 messages |
gl2cluster_99 | Contains messages from 2 months ago up to a month ago | 8.9GiB / 20,001,222 messages |
gl2cluster_98 | Contains messages from 2 months ago up to 2 months ago | 9.0GiB / 20,006,764 messages |
gl2cluster_97 | Contains messages from 2 months ago up to 2 months ago | 9.7GiB / 20,000,146 messages |
gl2cluster_96 | Contains messages from 2 months ago up to 2 months ago | 9.9GiB / 20,000,205 messages |
gl2cluster_95 | Contains messages from 2 months ago up to 2 months ago | 9.3GiB / 20,000,239 messages |
gl2cluster_94 | Contains messages from 3 months ago up to 2 months ago | 9.0GiB / 20,000,666 messages |
gl2cluster_93 | Contains messages from 3 months ago up to 2 months ago | 8.8GiB / 20,000,277 messages |
gl2cluster_92 | Contains messages from a year ago up to 2 months ago | 9.1GiB / 20,000,324 messages |
gl2cluster_91 | Contains messages from 2 months ago up to 2 months ago | 9.1GiB / 20,000,455 messages |
gl2cluster_90 | Contains messages from 2 months ago up to 2 months ago | 8.6GiB / 20,000,072 messages |
gl2cluster_89 | Contains messages from 3 months ago up to 2 months ago | 9.0GiB / 20,000,670 messages |
gl2cluster_88 | Contains messages from 3 months ago up to 3 months ago | 9.1GiB / 20,000,071 messages |
gl2cluster_87 | Contains messages from 3 months ago up to 3 months ago | 8.9GiB / 20,000,390 messages |
gl2cluster_86 | Contains messages from 3 months ago up to 3 months ago | 9.1GiB / 20,000,728 messages |
gl2cluster_85 | Contains messages from 3 months ago up to 3 months ago | 9.3GiB / 20,001,507 messages |
gl2cluster_84 | Contains messages from 3 months ago up to 3 months ago | 9.2GiB / 20,001,574 messages |
gl2cluster_83 | Contains messages from 3 months ago up to 3 months ago | 8.9GiB / 20,000,525 messages |
gl2cluster_82 | Contains messages from 3 months ago up to 3 months ago | 9.1GiB / 20,001,051 messages |
gl2cluster_81 | Contains messages from 3 months ago up to 3 months ago | 8.9GiB / 20,000,601 messages |
gl2cluster_80 | Contains messages from 3 months ago up to 3 months ago | 8.8GiB / 20,000,899 messages |
gl2cluster_79 | Contains messages from 3 months ago up to 3 months ago | 9.1GiB / 20,000,258 messages |
gl2cluster_78 | Contains messages from 3 months ago up to 3 months ago | 8.8GiB / 20,000,916 messages |
gl2cluster_77 | Contains messages from 4 months ago up to 3 months ago | 9.0GiB / 20,000,381 messages |
gl2cluster_76 | Contains messages from 3 months ago up to 3 months ago | 9.0GiB / 20,000,678 messages |
gl2cluster_75 | Contains messages from 4 months ago up to 3 months ago | 8.9GiB / 20,000,981 messages |
gl2cluster_74 | Contains messages from 4 months ago up to 3 months ago | 9.1GiB / 20,000,751 messages |
gl2cluster_73 | Contains messages from 4 months ago up to 4 months ago | 8.8GiB / 20,000,070 messages |
gl2cluster_72 | Contains messages from 4 months ago up to 4 months ago | 8.9GiB / 20,001,329 messages |
gl2cluster_71 | Contains messages from 11 months ago up to 4 months ago | 9.0GiB / 20,001,443 messages |
gl2cluster_70 | Contains messages from 4 months ago up to 4 months ago | 8.9GiB / 20,000,650 messages |
gl2cluster_69 | Contains messages from 4 months ago up to 4 months ago | 8.5GiB / 20,000,028 messages |
gl2cluster_68 | Contains messages from 4 months ago up to 4 months ago | 8.9GiB / 20,000,964 messages |
gl2cluster_67 | Contains messages from 4 months ago up to 4 months ago | 9.1GiB / 20,000,567 messages |
gl2cluster_66 | Contains messages from 4 months ago up to 4 months ago | 9.1GiB / 20,002,624 messages |
gl2cluster_65 | Contains messages from 4 months ago up to 4 months ago | 8.6GiB / 20,000,102 messages |
gl2cluster_64 | Contains messages from 7 months ago up to in a month | 9.2GiB / 20,000,133 messages |
gl2cluster_63 | Contains messages from 4 months ago up to 4 months ago | 9.1GiB / 20,000,556 messages |
gl2cluster_62 | Contains messages from 7 months ago up to 4 months ago | 9.3GiB / 20,000,480 messages |
So something interesting, it appears that my elaaticsearch master shows high cpu spikes at the same 6 hour intervals that my empty range detection script is detecting empty ranges.
I verified that this was not caused by manual range recalculations because the patterns are present for dates prior to implementing the script.
I attempted to Google for elaaticsearch maintenance internals but wasn't able to find anything relevant. Any ideas on what could be running as a refresh or clean up once every 6 hours?
Same or very similar issue here. Relative searches work but absolute returns nothing until we rebuild the indices. This only fixes it for a while. Please let me know if I can provide any info to help debug this. This is a huge issue for us right now.
I noticed this issue stopped occurring when I deleted some older indices and I am no longer at the maximum number of indices. Any chance you could delete a couple of older indices and see if the issue goes away until you hit your max indices count?
This isn't favorable but the proof of concept script I posted above has been working well for me with the added bonus that I can see a timestamp of when it runs
The underlying problem should be fixed in Graylog 1.3.3 (#1679).
We recently were not able to search logs using the absolute search and rebuilt the indexes which fixed the problem. However, we noticed that the same problem occurs on different environments which scare me since it means that our Prod environment might face it too. I found an API call from the Graylog doc to rebuilt the index which could help me to automatize the recovery. Nevertheless, I need a way to detect when the index is out of sync to resync it and according to the graylog doc "Graylog will on notify you when the stored metadata about index time ranges has run out of sync". But I actually don't see any notification about it. How can I know that the index is not in sync?