Streams appearing not to trigger consistently

urban-moniker commented 8 years ago

We are currently testing Graylog and have found an issue that has been traced back to Streams not appearing to trigger consistently. To validate this we have created a test scenario where we have hopefully removed any other variables:-

Expected Behavior

Logging firewall logs to Graylog - circa 300/second rate. Dedicated input that tags the incoming logs with a custom entry (techtype: firewall). Created a stream that should include all firewall logs by including any with the tag. Created a pipeline that tags any entries from that stream with another entry. (Note that the original issue we saw was technically the pipeline portion, but we traced it to the log entry not being processed in the stream)

Current Behavior

Approx 10% of log entries from the firewalls are not being processed in the stream (but do have the input tag - techtype: firewall)

Context

We have a load balanced 3 node cluster. We have disabled processing on two nodes (to ensure it's not an issue with the streams not processing properly on 1 node etc) and confirmed we still get the same issue. Graylog, Elastic and MongodB are all installed on each node and (currently) we only have 4GB RAM each, so if you say it's likely a memory issue that will be fine - we will upgrade if we go into production etc.

Your Environment

Ubuntu 16.04LTS Graylog 2.1.1+01d50e5 Oracle Corporation 1.8.0_91 on Linux 4.4.0-45-generic

edmundoa commented 8 years ago

Hi @urban-moniker,

Would you be so kind as to share with us the stream rules you use to match messages, as well as the pipeline rules that transform those messages? Please also check your Graylog server logs for any error messages.

Thank you.

kroepke commented 8 years ago

@urban-moniker I've just had a look at the code and couldn't find anything obvious, however I have one question:

Is your Graylog server experiencing longer pauses from garbage collection (unlikely with the load but still) or is it running on a VM that might pause execution? There's a timeout attached to stream rule matching (by default it's 2 seconds), that triggers whenever the wall clock time exceeds the threshold.

This can potentially lead to stream pausing, which could lead to the effect you are seeing. You can check for metrics on the affected node, by looking at the meter ruleTimeouts. If it is missing, there are no timeouts on that node, otherwise it will give you a count and 1/5/15 minute rates.

Thanks!

urban-moniker commented 8 years ago

Hi,

Stream rule is a simple attribute check:- Rule 1: tech_type must match exactly firewall

where this custom message attribute is inserted by the input.

The pipeline rule is as below:-

_rule "Tor Check" when has_field("DestinationIP") && has_field("SourceIP") && to_string($message.SourceZone) == "Internet" then set_field("Threat_Checked", "Yes"); //let torintel = tor_lookup(to_string($message.SourceIP)); //set_field("Threat_SrcIP_Tor_Exit", torintel.exit_nodeindicated); end

Note that I have currently disabled the additional function elements so all it does is add the 'static' attribute; the logic being if all expected log messages are tagged with 'tech_type= firewall' and 'Threat_Checked=Yes) then we know that all the main stages have processed correctly without adding on the impact of the Tor lookup etc.

Working backwards when we first noted not all entries seemed to have the right tags present we noticed:-

Not all firewall log entries from SourceZone:Internet had been tagged 'Threat Checked' Expanding those log entries we noted that they didn't appear to have been associated with the firewall 'stream'

I am trying to build out a proper repeatable test to remove all variables (we have a 3 node cluster fronted by a HA load balancer), and will enable debug monitoring to see if I can correlate any errors with 'missing' tags etc. I do not currently have any metric entries for RuleTimeouts, but have recently rebooted so that may not be a true representation.

I should note that we are seeing prlonged periods where we get 'No Master' errors in the logs which, as all 3 nodes are NTP'd to within ~20ms drift, looks like is related to low memory conditions. Therefore we are looking to increase the memory on these nodes and\or split the Elastic and Greylog services onto separate servers as it may be causing us more issues 'behind the scenes' etc (each server as 4GB RAM and runs Elastic, Graylog, MongodB etc)

In general we only seem to be able to sustain about 1,200 EPS per node (tested by pausing processing, letting a good backlog of 100k messages build up and timing how long they take to drain) which seems low (we are running as VM's on hyperconverged hardware and can sustain 10's of 1,000's IOPS at good latency for fully random writes), so I'm thinking I should fix the issues I know about first, then see if this issue fixes itself (which I suspect it will).

Thanks for the feedback

kroepke commented 8 years ago

What order are the "filter chain" and "pipeline processor" in? You can check on the System -> Configuration page.

Normally the "filter chain" (i.e. static fields/extractors/stream routing) are run first, and then the pipelines are invoked. If you have a stream rule checking for something the pipelines add, then it won't work the way you expect it to. However, you can also route message to streams from the pipeline rule, check out the route_to_stream function, which either takes the internal ID of the stream or its name (the latter being a bit slower).

urban-moniker commented 8 years ago

It is Filter followed by Pipeline, which is the order I believe is right i.e.

Static Field added by Input
Stream triggered by presence of Static Field
Pipeline attached to Stream and adds another static field

I'm going to force all firewall logs to a specific node, simplify the pipeline config by just adding a second field if the first one is present to reduce the variables. If that triggers reliably I'll force a backlog by disabling processing to build a backlog and let it surge through, and if that works then start re-adding the additional pipeline checks. All the while I'll monitor the logs and metrics to see if anything untoward jumps out.

I'll prob do this first without adding additional RAM\servers to see if I can cause the issue under 'controlled' conditions.

kroepke commented 8 years ago

Yes, that works. I was just checking that we aren't missing the obvious :)

urban-moniker commented 8 years ago

Cheers - at least 10% of time I miss the blindingly obvious!

I've just started the tests as above and so far it's looking 'clean' - all incoming firewall logs being tagged, directed to stream, and tagged by pipeline rule, and this is with larger volumes because it's our working day (up to 1,000 EPS) I'm actually wondering if the issue was to do with the pre-release version of the Threat_intelligence plugin I had installed at the time (since upgraded). I've also just tried re-adding the Tor function in the pipeline and before that would add a 'false' threat_indicated field for all messages, whereas the latest version doesn't (I assume as its not enabled, different issue!).

kroepke commented 8 years ago

Makes sense, if you suspect that a failing function is responsible, then also have an eye on the error counts of the rules. Those should be shown on the pipeline pages, too. In case there's an exception no all of the rule might be executed, depending on where the exception was thrown.

jalogisch commented 8 years ago

@urban-moniker can you reproduce this or give us some more input?

urban-moniker commented 8 years ago

@jalogisch Sorry (or happily!) but I've been unable to replicate this to date. It may\may not have been fixed since I upgraded the Threat Intel plugin, but on the basis we are testing and changing things I'm satisfied that this may have been 'operator error' so will close off the issue.

If we are able to replicate in future I'll re-open or open new.

Graylog2 / graylog2-server