How to detect index out of sync

lchaubert commented 8 years ago

We recently were not able to search logs using the absolute search and rebuilt the indexes which fixed the problem. However, we noticed that the same problem occurs on different environments which scare me since it means that our Prod environment might face it too. I found an API call from the Graylog doc to rebuilt the index which could help me to automatize the recovery. Nevertheless, I need a way to detect when the index is out of sync to resync it and according to the graylog doc "Graylog will on notify you when the stored metadata about index time ranges has run out of sync". But I actually don't see any notification about it. How can I know that the index is not in sync?

drewmmiranda commented 8 years ago

My thoughts so far is to use the rest api to query results with an absolute time range (use whatever scripting language you will use to auto generated the search time stamps) and see if the results are empty, as they would be when searching via web interface. If they are empty, have a condition to execute an index range rebuild.

Ideally the big would be fixed instead. There are quite a few issues opened on it and some supposed fixes but nothing seems to correct this so far.

drewmmiranda commented 8 years ago

Took a stab at a proof of concept that I just tested with positive results. Sorry it is crude!

#!/bin/bash

graylogServerHttp="http"
graylogServerHostName="graylog.fullqualified.name"
graylogServerPort="12900"

triggerAlert=0
indexrangeresult=`curl --user username:password "$graylogServerHttp"://"$graylogServerHostName":"$graylogServerPort"/search/universal/keyword?query=*\&keyword=last%201%20hour\&limit=100 | grep -oP "\"total_results\":\d+" | grep -oP "[0-9]+"`
if [ $? -eq 1 ]; then
    echo "graylog-server not running"
else
    #echo "Elasticsearch is running."
    if [ $indexrangeresult == "0" ]; then
        echo "range empty, recalc"
        triggerAlert=1
    else
        echo "range ok"
    fi
fi

if [ $triggerAlert -eq 1 ]; then
    echo "Running curl to recalculate ranges"
    curl -X POST --user username:password "$graylogServerHttp"://"$graylogServerHostName":"$graylogServerPort"/system/indices/ranges/rebuild
fi

joschi commented 8 years ago

The index ranges (which are really just the min and max timestamps of messages in this index) usually don't change once they've been calculated when an index is rotated.

Either you're modifying those indices externally (which you really shouldn't do) or this is another bug that we couldn't reproduce until now. In the latter case, recalculating the index ranges simply increases load on your Elasticsearch cluster and doesn't help finding or fixing the root cause for the problem you have.

drewmmiranda commented 8 years ago

I'll positive myself or others who have reported the same issue are not externally modifying those ranges. It seems to be most prevalent when a new index is created, such as when the max number of messages for that index is reached. Is there anything in the server log that could be helpful or is there a more verbose level of logging that I could enable?

lchaubert commented 8 years ago

@drewmmiranda Thank you very much for your answer and for your script, it's gonning to help us a lot. It is indeed probably the only way to do it for now. @joschi as for drewmmiranda we didn't play at all with the indexes.

drewmmiranda commented 8 years ago

@joschi I've been keeping an eye on how often this script fires and manually recalculates the index ranges. There does seem to be a pattern. About every 6 hours it detects the range is out of sync.

Time	Action
2015-11-29 11:37:11.995 -06:00	Retention Strategy Run
2015-11-29 11:50:02.322 -06:00	Re-calculating index ranges. (unsure if this is from retention strategy or not
2015-11-29 18:00:01.801 -06:00	Re-calculating index ranges. (Script ran)
2015-11-30 00:10:01.418 -06:00	Re-calculating index ranges. (Script ran)
2015-11-30 06:20:01.690 -06:00	Re-calculating index ranges. (Script ran)

Any ideas? It is probably a coincidence that my timezone offset is also 6 hours.

joschi commented 8 years ago

@drewmmiranda How often are your indices rotated (i. e. which index_retention_strategy and index_rotation_strategy has been configured)? How long do the index range calculations take (you'll find that information in your logs and in the web interface on the System -> Indices page)?

drewmmiranda commented 8 years ago

@joschi

How often are your indices rotated
- index_retention_strategy: delete
- index_rotation_strategy: count (elasticsearch_max_docs_per_index = 20000000)
How long do the index range calculations take: ~53 seconds

I also notice inconsistent ranges between indexes where an index contains messages from a larger time frame that it actuallyshould (e.g. index 114 from a month ago [30days?] to 4 days ago)

index	Messages Range	Size/Count
gl2cluster_116	Contains messages up to a few seconds ago	5.2GiB / 8,072,963 messages
gl2cluster_115	Contains messages from 19 days ago up to 18 hours ago	11.1GiB / 20,000,602 messages
gl2cluster_114	Contains messages from a month ago up to 4 days ago	12.2GiB / 20,000,054 messages
gl2cluster_113	Contains messages from a month ago up to 6 days ago	16.3GiB / 20,002,007 messages
gl2cluster_112	Contains messages from 10 months ago up to 10 days ago	12.7GiB / 20,001,265 messages
gl2cluster_111	Contains messages from 2 months ago up to 12 days ago	13.0GiB / 20,000,619 messages
gl2cluster_110	Contains messages from 2 months ago up to 14 days ago	11.5GiB / 20,000,581 messages
gl2cluster_109	Contains messages from 2 months ago up to 17 days ago	12.9GiB / 20,002,461 messages
gl2cluster_108	Contains messages from a month ago up to 19 days ago	16.2GiB / 20,002,734 messages
gl2cluster_107	Contains messages from 2 months ago up to 22 days ago	17.2GiB / 20,001,681 messages
gl2cluster_106	Contains messages from a month ago up to 24 days ago	12.9GiB / 20,000,486 messages
gl2cluster_105	Contains messages from 2 months ago up to a month ago	16.2GiB / 20,000,502 messages
gl2cluster_104	Contains messages from a month ago up to a month ago	13.8GiB / 20,000,408 messages
gl2cluster_103	Contains messages from 8 years ago up to a month ago	10.9GiB / 20,001,321 messages
gl2cluster_102	Contains messages from 2 months ago up to a month ago	12.1GiB / 20,000,411 messages
gl2cluster_101	Contains messages from 2 months ago up to a month ago	10.1GiB / 20,000,555 messages
gl2cluster_100	Contains messages from 2 months ago up to a month ago	10.2GiB / 20,000,211 messages
gl2cluster_99	Contains messages from 2 months ago up to a month ago	8.9GiB / 20,001,222 messages
gl2cluster_98	Contains messages from 2 months ago up to 2 months ago	9.0GiB / 20,006,764 messages
gl2cluster_97	Contains messages from 2 months ago up to 2 months ago	9.7GiB / 20,000,146 messages
gl2cluster_96	Contains messages from 2 months ago up to 2 months ago	9.9GiB / 20,000,205 messages
gl2cluster_95	Contains messages from 2 months ago up to 2 months ago	9.3GiB / 20,000,239 messages
gl2cluster_94	Contains messages from 3 months ago up to 2 months ago	9.0GiB / 20,000,666 messages
gl2cluster_93	Contains messages from 3 months ago up to 2 months ago	8.8GiB / 20,000,277 messages
gl2cluster_92	Contains messages from a year ago up to 2 months ago	9.1GiB / 20,000,324 messages
gl2cluster_91	Contains messages from 2 months ago up to 2 months ago	9.1GiB / 20,000,455 messages
gl2cluster_90	Contains messages from 2 months ago up to 2 months ago	8.6GiB / 20,000,072 messages
gl2cluster_89	Contains messages from 3 months ago up to 2 months ago	9.0GiB / 20,000,670 messages
gl2cluster_88	Contains messages from 3 months ago up to 3 months ago	9.1GiB / 20,000,071 messages
gl2cluster_87	Contains messages from 3 months ago up to 3 months ago	8.9GiB / 20,000,390 messages
gl2cluster_86	Contains messages from 3 months ago up to 3 months ago	9.1GiB / 20,000,728 messages
gl2cluster_85	Contains messages from 3 months ago up to 3 months ago	9.3GiB / 20,001,507 messages
gl2cluster_84	Contains messages from 3 months ago up to 3 months ago	9.2GiB / 20,001,574 messages
gl2cluster_83	Contains messages from 3 months ago up to 3 months ago	8.9GiB / 20,000,525 messages
gl2cluster_82	Contains messages from 3 months ago up to 3 months ago	9.1GiB / 20,001,051 messages
gl2cluster_81	Contains messages from 3 months ago up to 3 months ago	8.9GiB / 20,000,601 messages
gl2cluster_80	Contains messages from 3 months ago up to 3 months ago	8.8GiB / 20,000,899 messages
gl2cluster_79	Contains messages from 3 months ago up to 3 months ago	9.1GiB / 20,000,258 messages
gl2cluster_78	Contains messages from 3 months ago up to 3 months ago	8.8GiB / 20,000,916 messages
gl2cluster_77	Contains messages from 4 months ago up to 3 months ago	9.0GiB / 20,000,381 messages
gl2cluster_76	Contains messages from 3 months ago up to 3 months ago	9.0GiB / 20,000,678 messages
gl2cluster_75	Contains messages from 4 months ago up to 3 months ago	8.9GiB / 20,000,981 messages
gl2cluster_74	Contains messages from 4 months ago up to 3 months ago	9.1GiB / 20,000,751 messages
gl2cluster_73	Contains messages from 4 months ago up to 4 months ago	8.8GiB / 20,000,070 messages
gl2cluster_72	Contains messages from 4 months ago up to 4 months ago	8.9GiB / 20,001,329 messages
gl2cluster_71	Contains messages from 11 months ago up to 4 months ago	9.0GiB / 20,001,443 messages
gl2cluster_70	Contains messages from 4 months ago up to 4 months ago	8.9GiB / 20,000,650 messages
gl2cluster_69	Contains messages from 4 months ago up to 4 months ago	8.5GiB / 20,000,028 messages
gl2cluster_68	Contains messages from 4 months ago up to 4 months ago	8.9GiB / 20,000,964 messages
gl2cluster_67	Contains messages from 4 months ago up to 4 months ago	9.1GiB / 20,000,567 messages
gl2cluster_66	Contains messages from 4 months ago up to 4 months ago	9.1GiB / 20,002,624 messages
gl2cluster_65	Contains messages from 4 months ago up to 4 months ago	8.6GiB / 20,000,102 messages
gl2cluster_64	Contains messages from 7 months ago up to in a month	9.2GiB / 20,000,133 messages
gl2cluster_63	Contains messages from 4 months ago up to 4 months ago	9.1GiB / 20,000,556 messages
gl2cluster_62	Contains messages from 7 months ago up to 4 months ago	9.3GiB / 20,000,480 messages

drewmmiranda commented 8 years ago

So something interesting, it appears that my elaaticsearch master shows high cpu spikes at the same 6 hour intervals that my empty range detection script is detecting empty ranges.

I verified that this was not caused by manual range recalculations because the patterns are present for dates prior to implementing the script.

I attempted to Google for elaaticsearch maintenance internals but wasn't able to find anything relevant. Any ideas on what could be running as a refresh or clean up once every 6 hours?

rrtj3 commented 8 years ago

Same or very similar issue here. Relative searches work but absolute returns nothing until we rebuild the indices. This only fixes it for a while. Please let me know if I can provide any info to help debug this. This is a huge issue for us right now.

drewmmiranda commented 8 years ago

I noticed this issue stopped occurring when I deleted some older indices and I am no longer at the maximum number of indices. Any chance you could delete a couple of older indices and see if the issue goes away until you hit your max indices count?

This isn't favorable but the proof of concept script I posted above has been working well for me with the added bonus that I can see a timestamp of when it runs

joschi commented 8 years ago

The underlying problem should be fixed in Graylog 1.3.3 (#1679).

Graylog2 / graylog2-server

How to detect index out of sync #1578