Run-away load when extractors fail casting on appliances

CharlieEriksen commented 8 years ago

Problem description

A series of really unfortunate events were experienced on a graylog instance today I was working on, which exposed an interest set of fail conditions.

Firstly, an extractor was defined, which had 2 components:

It extracted a timestamp from log4net file, which was randomly unable to be parsed
It had a field defined as {DATA:level;string}, which apparently the Graylog 2.0.1 OVA already has defined as integer from the default appliance input extractors, thus causing lots of exceptions due to being unable to cast (DEBUG|INFO|ERROR|etc) to integer

Normally I don't think this should have been a big issue, as I imagine extraction failures should be fine. But what then occurred is that because of all these extractor errors, the elasticsearch logs started being filled with exceptions, which were then picked up by the default appliance input.

Now, the default appliance input does not seem to handle multi-line exception messages, so due to each attempt at parsing a message input, it generated 10-20 log lines from the elasticsearch logs. This suddenly put the appliance into overtime, with process and output buffer being at 100%, with the elasticsearch log generating close 1000 lines per second.

This meant that the journal suddenly became really large quickly. The extractors were still failing, and generating more logs. To stop this, we tried to remove the extractors that were throwing exceptions, but the exceptions were still occuring, seemingly due to the extractors maybe being "cached" when parsing the journal? We also stopped the input itself.

We had to remove the actual input, and restart graylog-server, for the exceptions to stop occuring due to the extractors no longer being applying to the journal backlog. Once the journal was caught up, we could then resume normal operation.

In addition, in the scenario where datetimes are failing to parse correctly, the messages are silently dropped, and lots of exceptions messages are generated, and again logs become huge.

Steps to reproduce the problem 1

Create an extractor which has a field defined as string, when a field already exists defined as integer
Start pushing messages at this input. At this point, you should get elasticsearch errors with casting/conversion errors
You should now see that you're getting a lot of exceptions being read from the default appliance input, which aren't being parsed as multi-line(Due to stacktrace), and thus generating a lot of messages
You should see no indication from the UI that anything is really wrong, other than buffers being full, and journal filling up
Steps to reproduce the problem 2
Create an extractor which tries to parse out a string as datetime, but the format doesn't match what was used for the extractor
Start pushing messages at this input, with the bad timestamp type
You should now observe that exceptions are occurring, and messages are silently being dropped
- Graylog Version: 2.0.1 OVA

joschi commented 8 years ago

@CharlieEriksen Take a look at http://docs.graylog.org/en/2.0/pages/configuration/elasticsearch.html#custom-index-mappings for information about custom index mappings in Elasticsearch.

I do agree, though, that we need to make mapping errors way more obvious in the UI so that users can take action and don't have to check the logs all the time.

CharlieEriksen commented 8 years ago

Well, I sort of see a few things that probably should be fixed:

When a message fails to insert due to casting error(Such as int to string, or being unable to parse a datetime string), it shouldn't just spam the logs, but raise the error in the UI, and not drop the message
When stacktraces are generated in elasticsearch on the appliance, the default appliance input should parse the logs as multi-line, since it currently generates a lot of messages per exception, causing undue load.
When going through the journal, the processor should respect changes made to the extractors for the input.

CharlieEriksen commented 8 years ago

Bump. Just ran into this again. Scenario:

Changed the type of an extractor from ;int to ;long, and cycled the deflector to change mapping.
Saw the journal start building up due to indexing issues. Messages were silently dropped, process buffer totally full.
Changed the field back to int. No improvement in the buffer or performance.
Deleted the extractor, no improvement either
Deleted the input, and recreated with extractor again.
Suddenly the journal starts going down

This seems like a pretty bad issue, if you can't gracefully recover from a silent extractor error without deleting an input.

mikkolehtisalo commented 7 years ago

In my opinion there are two separate issues here. The first is naturally improving user feedback, and handling mapping errors.

The second is the way the appliance is setup for cyclical logging:

Action X generates a logged event
Event is forwarded to Graylog
Graylog processes message, causing Action X

I have had some very bad experiences with Graylog going into logging coma trying to handle logs created by its log handling. It does not matter how powerful setup you have, that situation takes everything down in couple seconds.

Naturally, every situation that triggers the phenomenon are bugs, and must be handled as such. However being prone to a behavior that escalates the impact of said bugs is a design flaw. It is recommendable to consider design changes that would mitigate the effects of cyclical logging situations.

CharlieEriksen commented 7 years ago

There's absolutely several issues at play here. When talking to Lennart, he suggested filing it all as one issue though.

I've continued to experience this issue repeatedly though, especially the one listed above, where a bad extractor will put the Graylog instance into an unrecoverable situation, which can only be fixed by deleting an input, and recreating it. This makes mapping of fields to integers REALLY DANGEROUS, since it has such a high risk of forcing you to ensure a downtime of the cluster.

It seems like a big issue that changes to extractors does not apply to messages in the journal, for instance(At least that's my observation, as to how it works).

Graylog2 / graylog2-server

Run-away load when extractors fail casting on appliances #2423

Problem description

Steps to reproduce the problem 1

Steps to reproduce the problem 2