ankiit / logstash

Automatically exported from code.google.com/p/logstash
0 stars 0 forks source link

Corrupt data coming out of logstash grok filter #47

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I'm using the following configuration:

inputs:
  all:
  - amqp://REDACTED/logstash/fanout/raw_logs
filters:
- date:
    syslog:
      timestamp: "%b %e %H:%M:%S"
      timestamp8601: ISO8601
    apache-access:
      timestamp: "%d/%b/%Y:%H:%M:%S %Z"
    apache-error:
      timestamp: "%a %b %d %H:%M:%S %Y"
    fanforce-request:
      timestamp: "%Y-%m-%dT%H:%M:%S.%f"
      timestamp8601: ISO8601
- grok:
    syslog:
      patterns:
      - %{SYSLOGLINE}
    apache-error:
      patterns:
      - %{APACHE_ERROR_LOG}
    apache-combined:
      patterns:
      - %{COMBINEDAPACHELOG_TIPPR}
    myapp-request:
      patterns:
      - %{MYAPP_STATS_EVENT}
      - %{MYAPP_LOG_GENERAL}
outputs:
- 
elasticsearch://REDACTED/logstash/events_river?method=river&type=rabbitmq&host=R
EDACTED&user=logstash&pass=REDACTED&vhost=logstash&queue=elasticsearch&exchange=
parsed_logs&exchange_type=fanout&durable=true

...and the following nondefault grok patterns:

URIPATH_LOCAL (?:/[A-Za-z0-9$.+!*'(),~:#%_=-]*)+
URIPARAM_LOCAL \?[?A-Za-z0-9$.+!*'(),~#%&/=:;_-]*
URIPATHPARAM_LOCAL %{URIPATH_LOCAL}(?:%{URIPARAM_LOCAL})?
URI_LOCAL 
%{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM_LOCAL})?

APACHE_LOG_LEVEL (?:emerg|alert|crit|error|warn|notice|info|debug)
APACHE_ERROR_LOG \[%{DATESTAMP_OTHER:timestamp}\] \[%{APACHE_LOG_LEVEL:level}\] 
%{GREEDYDATA:message}
COMBINEDAPACHELOG_LOCAL %{IPORHOST:clientip} %{USER:ident} %{USER:auth} 
\[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{URIPATHPARAM_LOCAL:request} 
HTTP/%{NUMBER:httpversion:float}" %{NUMBER:response:int} 
(?:%{NUMBER:bytes:int}|-) "(?:%{URI_LOCAL:referrer}|-?)" %{QS:agent}(?: 
<%{NUMBER:bytes_in:int} >%{NUMBER:bytes_out:int} %{NUMBER:response_time:int}ms 
"%{HOSTNAME:domain}"(?: %{NUMBER:seconds_time:int})?)?(?: 
FF:"(?:%{IPORHOST:forwarded_for}|-)")?

MYAPP_LEVELNAME (?:DEBUG|INFO|WARNING|ERROR|CRITICAL)
MYAPP_TIMESTAMP 
%{YEAR}-%{MONTHNUM}-%{MONTHDAY}T%{HOUR}:%{MINUTE}:%{SECOND}[.][0-9]+
MYAPP_MODULE fanforce[.a-z]+
MYAPP_EMAIL [^ ]+@[^ ]+

MYAPP_REQUEST %{MYAPP_TIMESTAMP} %{MYAPP_MODULE:module} 
%{MYAPP_LEVELNAME:level} (?:%{IPORHOST:source_address}|-) 
(?:%{HOSTNAME:domain}|-) %{URIPATH:path} (?:%{MYAPP_EMAIL:email}|anonymous) 
%{GREEDYDATA:message}

MYAPP_NAME [a-zA-Z0-9_-]+
MYAPP_LOG_HEADER %{MYAPP_TIMESTAMP} %{MYAPP_MODULE:module} 
%{MYAPP_LEVELNAME:level} (?:%{IPORHOST:source_address}|-) 
(?:%{HOSTNAME:domain}|-) %{URIPATH:path} (?:%{MYAPP_EMAIL:email}|anonymous)
MYAPP_LOG_GENERAL %{MYAPP_LOG_HEADER} %{GREEDYDATA:message}
MYAPP_STATS_EVENT %{MYAPP_LOG_HEADER} stats-event 
%{MYAPP_NAME:stats_event_type}(?: \[amount:%{NUMBER:amount:float}\])?(?: 
\[channel:(?:%{MYAPP_NAME:channel})\])?(?: 
\[geography:(?:%{MYAPP_NAME:geography})?\])?(?: 
\[offer:%{MYAPP_NAME:offer}\])?(?: \[publisher:%{MYAPP_NAME:publisher}\])?

A substantial subset of my apache-combined logs leave the grok filter with data 
which not only fails to match the pattern in question, but which is not even 
valid UTF-8. An example of such an invalid message:

{"@source_host":"www8.corp.myapp.com","@source_path":"/var/log/myapp/request.log
","@tags":[],"@timestamp":"2011-03-03T19:41:07.860691Z","@type":"myapp-request",
"@source":"file://www8.corp.myapp.com/var/log/myapp/request.log","@message":"201
1-03-03T19:41:07.860103 myapp.core.publisher.views INFO 128.83.44.159 
redactedpartnerdomain.com /a/channel/subscribe/ anonymous attempting to 
subscribe redacted_customer@hotmail.com to 
redactedpartner-austin","@fields":{"message":["attempting to subscribe 
redacted_customer@hotmail.com to 
redactedpartner-austin"],"HOSTNAME":["128.83.44.159"],"module":["myapp.core.publ
isher.views"],"SECOND":["07"],"domain":["redactedpartnerdomain.com"],"level":["I
NFO"],"HOUR":["19"],"FANFORCE_TIMESTAMP":["\xc0\xbdJ\\npAn\xb703T19:41:07.860103
"],"FANFORCE_LOG_HEADER":["2011-03-03T19:41:07.860103 
myapp.core.publisher.views INFO 128.83.44.159 redactedpartnerdomain.com 
/a/channel/subscribe/ 
anonymous"],"source_address":["128.83.44.159"],"MONTHDAY":["03"],"MONTHNUM":["An
"],"IP":[],"YEAR":["\xc0\xbdJ\\n"],"path":["/a/channel/subscribe/"],"MINUTE":["4
1"],"email":[]}}

Clearly, many of the fields do not in any way match the message.

Please contact directly if unredacted data is necessary for debugging.

Original issue reported on code.google.com by charles....@gmail.com on 4 Mar 2011 at 1:43

GoogleCodeExporter commented 9 years ago
This is a bug in how we use FFI; I found the problem.

I will update the unit tests to try and cover this and push a new jls-grok gem.

Original comment by jls.semi...@gmail.com on 4 Mar 2011 at 3:34

GoogleCodeExporter commented 9 years ago
Regression test added 
(ruby/test/regression/grokmatch-subject-garbagecollected-early.rb) and run in 
the test suite. Passes now.

New version pushed with this fix: jls-grok 0.4.4

Original comment by jls.semi...@gmail.com on 4 Mar 2011 at 3:36