Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.33k stars 1.06k forks source link

How to detect index out of sync #1578

Closed lchaubert closed 8 years ago

lchaubert commented 8 years ago

We recently were not able to search logs using the absolute search and rebuilt the indexes which fixed the problem. However, we noticed that the same problem occurs on different environments which scare me since it means that our Prod environment might face it too. I found an API call from the Graylog doc to rebuilt the index which could help me to automatize the recovery. Nevertheless, I need a way to detect when the index is out of sync to resync it and according to the graylog doc "Graylog will on notify you when the stored metadata about index time ranges has run out of sync". But I actually don't see any notification about it. How can I know that the index is not in sync?

drewmmiranda commented 8 years ago

My thoughts so far is to use the rest api to query results with an absolute time range (use whatever scripting language you will use to auto generated the search time stamps) and see if the results are empty, as they would be when searching via web interface. If they are empty, have a condition to execute an index range rebuild.

Ideally the big would be fixed instead. There are quite a few issues opened on it and some supposed fixes but nothing seems to correct this so far.

drewmmiranda commented 8 years ago

Took a stab at a proof of concept that I just tested with positive results. Sorry it is crude!

#!/bin/bash

graylogServerHttp="http"
graylogServerHostName="graylog.fullqualified.name"
graylogServerPort="12900"

triggerAlert=0
indexrangeresult=`curl --user username:password "$graylogServerHttp"://"$graylogServerHostName":"$graylogServerPort"/search/universal/keyword?query=*\&keyword=last%201%20hour\&limit=100 | grep -oP "\"total_results\":\d+" | grep -oP "[0-9]+"`
if [ $? -eq 1 ]; then
    echo "graylog-server not running"
else
    #echo "Elasticsearch is running."
    if [ $indexrangeresult == "0" ]; then
        echo "range empty, recalc"
        triggerAlert=1
    else
        echo "range ok"
    fi
fi

if [ $triggerAlert -eq 1 ]; then
    echo "Running curl to recalculate ranges"
    curl -X POST --user username:password "$graylogServerHttp"://"$graylogServerHostName":"$graylogServerPort"/system/indices/ranges/rebuild
fi
joschi commented 8 years ago

The index ranges (which are really just the min and max timestamps of messages in this index) usually don't change once they've been calculated when an index is rotated.

Either you're modifying those indices externally (which you really shouldn't do) or this is another bug that we couldn't reproduce until now. In the latter case, recalculating the index ranges simply increases load on your Elasticsearch cluster and doesn't help finding or fixing the root cause for the problem you have.

drewmmiranda commented 8 years ago

I'll positive myself or others who have reported the same issue are not externally modifying those ranges. It seems to be most prevalent when a new index is created, such as when the max number of messages for that index is reached. Is there anything in the server log that could be helpful or is there a more verbose level of logging that I could enable?

lchaubert commented 8 years ago

@drewmmiranda Thank you very much for your answer and for your script, it's gonning to help us a lot. It is indeed probably the only way to do it for now. @joschi as for drewmmiranda we didn't play at all with the indexes.

drewmmiranda commented 8 years ago

@joschi I've been keeping an eye on how often this script fires and manually recalculates the index ranges. There does seem to be a pattern. About every 6 hours it detects the range is out of sync.

Time Action
2015-11-29 11:37:11.995 -06:00 Retention Strategy Run
2015-11-29 11:50:02.322 -06:00 Re-calculating index ranges. (unsure if this is from retention strategy or not
2015-11-29 18:00:01.801 -06:00 Re-calculating index ranges. (Script ran)
2015-11-30 00:10:01.418 -06:00 Re-calculating index ranges. (Script ran)
2015-11-30 06:20:01.690 -06:00 Re-calculating index ranges. (Script ran)

Any ideas? It is probably a coincidence that my timezone offset is also 6 hours.

joschi commented 8 years ago

@drewmmiranda How often are your indices rotated (i. e. which index_retention_strategy and index_rotation_strategy has been configured)? How long do the index range calculations take (you'll find that information in your logs and in the web interface on the System -> Indices page)?

drewmmiranda commented 8 years ago

@joschi

I also notice inconsistent ranges between indexes where an index contains messages from a larger time frame that it actuallyshould (e.g. index 114 from a month ago [30days?] to 4 days ago)

index Messages Range Size/Count
gl2cluster_116 Contains messages up to a few seconds ago 5.2GiB / 8,072,963 messages
gl2cluster_115 Contains messages from 19 days ago up to 18 hours ago 11.1GiB / 20,000,602 messages
gl2cluster_114 Contains messages from a month ago up to 4 days ago 12.2GiB / 20,000,054 messages
gl2cluster_113 Contains messages from a month ago up to 6 days ago 16.3GiB / 20,002,007 messages
gl2cluster_112 Contains messages from 10 months ago up to 10 days ago 12.7GiB / 20,001,265 messages
gl2cluster_111 Contains messages from 2 months ago up to 12 days ago 13.0GiB / 20,000,619 messages
gl2cluster_110 Contains messages from 2 months ago up to 14 days ago 11.5GiB / 20,000,581 messages
gl2cluster_109 Contains messages from 2 months ago up to 17 days ago 12.9GiB / 20,002,461 messages
gl2cluster_108 Contains messages from a month ago up to 19 days ago 16.2GiB / 20,002,734 messages
gl2cluster_107 Contains messages from 2 months ago up to 22 days ago 17.2GiB / 20,001,681 messages
gl2cluster_106 Contains messages from a month ago up to 24 days ago 12.9GiB / 20,000,486 messages
gl2cluster_105 Contains messages from 2 months ago up to a month ago 16.2GiB / 20,000,502 messages
gl2cluster_104 Contains messages from a month ago up to a month ago 13.8GiB / 20,000,408 messages
gl2cluster_103 Contains messages from 8 years ago up to a month ago 10.9GiB / 20,001,321 messages
gl2cluster_102 Contains messages from 2 months ago up to a month ago 12.1GiB / 20,000,411 messages
gl2cluster_101 Contains messages from 2 months ago up to a month ago 10.1GiB / 20,000,555 messages
gl2cluster_100 Contains messages from 2 months ago up to a month ago 10.2GiB / 20,000,211 messages
gl2cluster_99 Contains messages from 2 months ago up to a month ago 8.9GiB / 20,001,222 messages
gl2cluster_98 Contains messages from 2 months ago up to 2 months ago 9.0GiB / 20,006,764 messages
gl2cluster_97 Contains messages from 2 months ago up to 2 months ago 9.7GiB / 20,000,146 messages
gl2cluster_96 Contains messages from 2 months ago up to 2 months ago 9.9GiB / 20,000,205 messages
gl2cluster_95 Contains messages from 2 months ago up to 2 months ago 9.3GiB / 20,000,239 messages
gl2cluster_94 Contains messages from 3 months ago up to 2 months ago 9.0GiB / 20,000,666 messages
gl2cluster_93 Contains messages from 3 months ago up to 2 months ago 8.8GiB / 20,000,277 messages
gl2cluster_92 Contains messages from a year ago up to 2 months ago 9.1GiB / 20,000,324 messages
gl2cluster_91 Contains messages from 2 months ago up to 2 months ago 9.1GiB / 20,000,455 messages
gl2cluster_90 Contains messages from 2 months ago up to 2 months ago 8.6GiB / 20,000,072 messages
gl2cluster_89 Contains messages from 3 months ago up to 2 months ago 9.0GiB / 20,000,670 messages
gl2cluster_88 Contains messages from 3 months ago up to 3 months ago 9.1GiB / 20,000,071 messages
gl2cluster_87 Contains messages from 3 months ago up to 3 months ago 8.9GiB / 20,000,390 messages
gl2cluster_86 Contains messages from 3 months ago up to 3 months ago 9.1GiB / 20,000,728 messages
gl2cluster_85 Contains messages from 3 months ago up to 3 months ago 9.3GiB / 20,001,507 messages
gl2cluster_84 Contains messages from 3 months ago up to 3 months ago 9.2GiB / 20,001,574 messages
gl2cluster_83 Contains messages from 3 months ago up to 3 months ago 8.9GiB / 20,000,525 messages
gl2cluster_82 Contains messages from 3 months ago up to 3 months ago 9.1GiB / 20,001,051 messages
gl2cluster_81 Contains messages from 3 months ago up to 3 months ago 8.9GiB / 20,000,601 messages
gl2cluster_80 Contains messages from 3 months ago up to 3 months ago 8.8GiB / 20,000,899 messages
gl2cluster_79 Contains messages from 3 months ago up to 3 months ago 9.1GiB / 20,000,258 messages
gl2cluster_78 Contains messages from 3 months ago up to 3 months ago 8.8GiB / 20,000,916 messages
gl2cluster_77 Contains messages from 4 months ago up to 3 months ago 9.0GiB / 20,000,381 messages
gl2cluster_76 Contains messages from 3 months ago up to 3 months ago 9.0GiB / 20,000,678 messages
gl2cluster_75 Contains messages from 4 months ago up to 3 months ago 8.9GiB / 20,000,981 messages
gl2cluster_74 Contains messages from 4 months ago up to 3 months ago 9.1GiB / 20,000,751 messages
gl2cluster_73 Contains messages from 4 months ago up to 4 months ago 8.8GiB / 20,000,070 messages
gl2cluster_72 Contains messages from 4 months ago up to 4 months ago 8.9GiB / 20,001,329 messages
gl2cluster_71 Contains messages from 11 months ago up to 4 months ago 9.0GiB / 20,001,443 messages
gl2cluster_70 Contains messages from 4 months ago up to 4 months ago 8.9GiB / 20,000,650 messages
gl2cluster_69 Contains messages from 4 months ago up to 4 months ago 8.5GiB / 20,000,028 messages
gl2cluster_68 Contains messages from 4 months ago up to 4 months ago 8.9GiB / 20,000,964 messages
gl2cluster_67 Contains messages from 4 months ago up to 4 months ago 9.1GiB / 20,000,567 messages
gl2cluster_66 Contains messages from 4 months ago up to 4 months ago 9.1GiB / 20,002,624 messages
gl2cluster_65 Contains messages from 4 months ago up to 4 months ago 8.6GiB / 20,000,102 messages
gl2cluster_64 Contains messages from 7 months ago up to in a month 9.2GiB / 20,000,133 messages
gl2cluster_63 Contains messages from 4 months ago up to 4 months ago 9.1GiB / 20,000,556 messages
gl2cluster_62 Contains messages from 7 months ago up to 4 months ago 9.3GiB / 20,000,480 messages
drewmmiranda commented 8 years ago

So something interesting, it appears that my elaaticsearch master shows high cpu spikes at the same 6 hour intervals that my empty range detection script is detecting empty ranges.

I verified that this was not caused by manual range recalculations because the patterns are present for dates prior to implementing the script.

I attempted to Google for elaaticsearch maintenance internals but wasn't able to find anything relevant. Any ideas on what could be running as a refresh or clean up once every 6 hours?

rrtj3 commented 8 years ago

Same or very similar issue here. Relative searches work but absolute returns nothing until we rebuild the indices. This only fixes it for a while. Please let me know if I can provide any info to help debug this. This is a huge issue for us right now.

drewmmiranda commented 8 years ago

I noticed this issue stopped occurring when I deleted some older indices and I am no longer at the maximum number of indices. Any chance you could delete a couple of older indices and see if the issue goes away until you hit your max indices count?

This isn't favorable but the proof of concept script I posted above has been working well for me with the added bonus that I can see a timestamp of when it runs

joschi commented 8 years ago

The underlying problem should be fixed in Graylog 1.3.3 (#1679).