Grok oniguruma pattern not parsing field and type (i.e. <throughputtime:int>x results in "int" field in Elasticsearch")

tootedom commented 9 years ago

This Oniguruma pattern: (?throughputtime:int[0-9]+) now results in the field "int" in elasticsearch, and not "throughputtime" as an int type in elasticsearch.

We noticed this issue after upgrading from 1.4.2 to 1.5.2. The problem presented itself in 1.5.0 of logstash.

Replication is as follows:

message:

INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760

Config:


input {
  stdin { }
}

filter {
  grok {
    patterns_dir => "/opt/logstash/patterns/"
    match        => [ "message", ".*pii\s+(?<pii>[^,]+),.*correlationId\s+(?<correlationId>[^,]+),.*throughput time\s+(?<throughputtime:int>[0-9]+).*$"]
  }
}

output {
  stdout { codec => rubydebug }
}

Running in 1.4.2 results in the following:

{
           "message" => "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760",
          "@version" => "1",
        "@timestamp" => "2015-07-03T14:32:49.411Z",
               "pii" => "0022311562901666",
     "correlationId" => "426f54b9-0dac-4c44-9d30-8cb815febe3f",
    "throughputtime" => 1760
}

Running in 1.5.0 results in the following:

{
          "message" => "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760",
         "@version" => "1",
       "@timestamp" => "2015-07-03T14:26:30.398Z",
              "pii" => "0022311562901666",
    "correlationId" => "426f54b9-0dac-4c44-9d30-8cb815febe3f",
              "int" => "1760"
}

Simple command line examples:

1.4.3

echo "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760" | logstash-1.4.3/bin/logstash -f filter.conf 
{
           "message" => "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760",
          "@version" => "1",
        "@timestamp" => "2015-07-03T14:32:49.411Z",
               "pii" => "0022311562901666",
     "correlationId" => "426f54b9-0dac-4c44-9d30-8cb815febe3f",
    "throughputtime" => 1760
}

1.5.0

echo "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760" | logstash-1.5.0/bin/logstash -f filter.conf 
{
          "message" => "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760",
         "@version" => "1",
       "@timestamp" => "2015-07-03T14:33:01.506Z",
              "pii" => "0022311562901666",
    "correlationId" => "426f54b9-0dac-4c44-9d30-8cb815febe3f",
              "int" => "1760"
}

1.5.2

echo "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760" | logstash-1.5.2/bin/logstash -f filter.conf 
{
          "message" => "INFO [2015-07-03 01:46:13,420] Processor: Successfully dispatched message with pii 0022311562901666, correlationId 426f54b9-0dac-4c44-9d30-8cb815febe3f, and throughput time 1760",
         "@version" => "1",
       "@timestamp" => "2015-07-03T14:52:24.951Z",
              "pii" => "0022311562901666",
    "correlationId" => "426f54b9-0dac-4c44-9d30-8cb815febe3f",
              "int" => "1760"
}

I've looked at the changelog.md and nothing indicate a breaking change with regards to GROK and this field matching and type specification. Is this a known issue?

thanks /dom

tootedom commented 9 years ago

The small knock on effect of not being able to specify the data type in the Oniguruma pattern style, Is that logstash no longer converts a - (that's a hyphen) to the base data type's default value.

So if you set up a template mapping in elasticsearch, i.e.

     "ngx_upstream_response_time" :  { "index": "not_analyzed", "type": "float" },

but the logstash sends a "-" (missing data), elasticsearch throws a NumberFormatException, and 400 back to logstash, which drops the message, outputting the dropped message to logstash.log (which if you have millions of logs eats all the disk on the instance :-)

Example:

input {
  stdin { }
}

filter {
  grok {
    patterns_dir => "/opt/logstash/patterns/"
    match        => [ "message", "upstream_response_time:(?<ngx_upstream_response_time:float>[0-9\-.]{1,999})"]
  }
}

output {
  stdout { codec => rubydebug }
}

1.4.2 would convert to 0.0 as follows:

echo "upstream_response_time:-" | logstash-1.4.2/bin/logstash -f filter.conf
{
                       "message" => "upstream_response_time:-",
                      "@version" => "1",
                    "@timestamp" => "2015-07-04T10:20:53.177Z",
    "ngx_upstream_response_time" => 0.0
}

tootedom commented 9 years ago

Given the following pattern

upstream_response_time:(?<ngx_upstream_response_time:float>[0-9\-.]{1,999})

In 1.4.2 logstash the named capture when passed to the compile_capture_handler (https://github.com/elastic/logstash/blob/1.4/lib/logstash/filters/grok.rb#L341), looked like this:

_:ngx_upstream_response_time:float

I'm guessing (only guessing), that is is something grok::pile regex compilation did to the named capture.

In logstash 1.5.2 the Regexp named capture does not have this added (https://github.com/jordansissel/ruby-grok/blob/master/lib/grok-pure.rb#L139). So the named capture when sent to the split(":") is this

ngx_upstream_response_time:float

In the compile_captures_func function in 1.5.2, would it cause any issues if the split was as follows instead:

 pattern, name, coerce = "_:#{name}".split(":")

robfr commented 8 years ago

This issue still effects logstash 2.3.2, our only workaround is to use a patterns file.

elastic / logstash

Grok oniguruma pattern not parsing field and type (i.e. <throughputtime:int>x results in "int" field in Elasticsearch") #3557