Open CharlieEriksen opened 8 years ago
@CharlieEriksen Take a look at http://docs.graylog.org/en/2.0/pages/configuration/elasticsearch.html#custom-index-mappings for information about custom index mappings in Elasticsearch.
I do agree, though, that we need to make mapping errors way more obvious in the UI so that users can take action and don't have to check the logs all the time.
Well, I sort of see a few things that probably should be fixed:
Bump. Just ran into this again. Scenario:
This seems like a pretty bad issue, if you can't gracefully recover from a silent extractor error without deleting an input.
In my opinion there are two separate issues here. The first is naturally improving user feedback, and handling mapping errors.
The second is the way the appliance is setup for cyclical logging:
I have had some very bad experiences with Graylog going into logging coma trying to handle logs created by its log handling. It does not matter how powerful setup you have, that situation takes everything down in couple seconds.
Naturally, every situation that triggers the phenomenon are bugs, and must be handled as such. However being prone to a behavior that escalates the impact of said bugs is a design flaw. It is recommendable to consider design changes that would mitigate the effects of cyclical logging situations.
There's absolutely several issues at play here. When talking to Lennart, he suggested filing it all as one issue though.
I've continued to experience this issue repeatedly though, especially the one listed above, where a bad extractor will put the Graylog instance into an unrecoverable situation, which can only be fixed by deleting an input, and recreating it. This makes mapping of fields to integers REALLY DANGEROUS, since it has such a high risk of forcing you to ensure a downtime of the cluster.
It seems like a big issue that changes to extractors does not apply to messages in the journal, for instance(At least that's my observation, as to how it works).
Problem description
A series of really unfortunate events were experienced on a graylog instance today I was working on, which exposed an interest set of fail conditions.
Firstly, an extractor was defined, which had 2 components:
Normally I don't think this should have been a big issue, as I imagine extraction failures should be fine. But what then occurred is that because of all these extractor errors, the elasticsearch logs started being filled with exceptions, which were then picked up by the default appliance input.
Now, the default appliance input does not seem to handle multi-line exception messages, so due to each attempt at parsing a message input, it generated 10-20 log lines from the elasticsearch logs. This suddenly put the appliance into overtime, with process and output buffer being at 100%, with the elasticsearch log generating close 1000 lines per second.
This meant that the journal suddenly became really large quickly. The extractors were still failing, and generating more logs. To stop this, we tried to remove the extractors that were throwing exceptions, but the exceptions were still occuring, seemingly due to the extractors maybe being "cached" when parsing the journal? We also stopped the input itself.
We had to remove the actual input, and restart graylog-server, for the exceptions to stop occuring due to the extractors no longer being applying to the journal backlog. Once the journal was caught up, we could then resume normal operation.
In addition, in the scenario where datetimes are failing to parse correctly, the messages are silently dropped, and lots of exceptions messages are generated, and again logs become huge.
Steps to reproduce the problem 1
Steps to reproduce the problem 2