elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
78 stars 3.5k forks source link

Regexp patterns cannot end with backslash (even if backslash-escaped) #9701

Open yaauie opened 6 years ago

yaauie commented 6 years ago

An issue in the treetop grammar can cause the regexp to be too aggressive at capturing, which can ultimately cause the pipeline compilation to fail when a pattern ends with a backslash (even if that backslash is backslash-escaped).

I believe I know how to fix this, but would like to discuss it to consider the risk before we tackle the compilation of the treetop grammar, since the tooling to compile the grammar does not immediately appear to be working for me and fixing it may take some effort.


The treetop grammar currently defines a regexp as:

  rule regexp
    ( '/' ( '\/' / !'/' . )* '/'  <LogStash::Config::AST::RegExp>)
  end

In plain English that is:

note: rule 2b uses a literal "Negative Lookahead Assertion" to prevent it from matching forward slashes

Using this definition, in a config file the sequence of characters /foo\\bar/ will correctly generate a Regexp that matches the literal string foo\bar, but the pattern /fubar\\/, which attempts to match the literal string fubar\ will capture the intended (3) closing forwardslash as (2a), fail to close, and continue capturing until it either hits EOL or an unescaped forward slash from another expression, typically resulting in a pipeline compilation error.

The below table shows where the parser captures too eagerly; each parenthised group in the parse column indicates the rule that caused the pattern to match, followed by the match.

pattern parse state
/foo/ (1/)(2bf)(2bo)(2bo)(3/) OK
/foo\\bar/ (1/)(2bf)(2bo)(2bo)(2b\)(2b\)(2bb)(2ba)(2br)(3/) OK
/foo\/bar/ (1/)(2bf)(2bo)(2bo)(2a\/)(2bb)(2ba)(2br)(3/) OK
/fubar\\/ (1/)(2bf)(2bu)(2bb)(2ba)(2br)(2b\)(2a\/)... OPEN

I believe that we can redefine the grammar for regexp to capture escape sequences better without significant risk:

   rule regexp
-    ( '/' ( '\/' / !'/' . )* '/'  <LogStash::Config::AST::RegExp>)
+    ( '/' ( '\' . / !'/' . )* '/'  <LogStash::Config::AST::RegExp>)
   end

In plain English that is:

This will slightly change the meaning of how escape sequences get parsed (e.g., sequences that matched 2b twice before may match 2a now), but because backslashes and the character that followed them were always captured and this change does nothing to process the escape sequences, this fix poses no external risk.

pattern parse state
/foo/ (1/)(2bf)(2bo)(2bo)(3/) OK
/foo\\bar/ (1/)(2bf)(2bo)(2bo)(2a\\)(2bb)(2ba)(2br)(3/) OK
/foo\/bar/ (1/)(2bf)(2bo)(2bo)(2a\/)(2bb)(2ba)(2br)(3/) OK
/fubar\\/ (1/)(2bf)(2bu)(2bb)(2ba)(2br)(2a\/)(3\/)... OK
yaauie commented 6 years ago

(note: I believe our grammar behaves in similar ways with double- and single-quoted strings that end in backslashes, and that a similar fix may be non-breaking there too)

honzakral commented 6 years ago

I believe I am hitting this when trying to set field_split in kv filter to just \.

The proposed fix works only when config.support_escapes is set to true but if it is false then even '\' should be a valid string according to the docs. The setting needs to be taken into account when parsing, not after like now.

yaauie commented 6 years ago

I believe I am hitting this when trying to set field_split in kv filter to just /.

Did you mean forward-slash here? A forward-slash works just fine as a field splitter as evidenced below, but a lone backslash will fail in parsing as described above (regardless of the value of config.support_escapes).

╭─{ yaauie@castrovel:~/src/elastic/pr-scratch/logstash-9701-backslashes }
╰─○ echo "foo=bar/baz=bingo" | logstash-6.2.4/bin/logstash -e 'filter { kv { field_split => "/" source => "message" } }'
...
[2018-06-14T17:23:37,265][INFO ][logstash.agent           ] Pipelines running {:count=>1, :pipelines=>["main"]}
{
           "baz" => "bingo",
           "foo" => "bar",
          "host" => "castrovel.local",
    "@timestamp" => 2018-06-14T17:23:37.223Z,
          "type" => "stdin",
       "message" => "foo=bar/baz=bingo",
      "@version" => "1"
}
[2018-06-14T17:23:37,559][INFO ][logstash.pipeline        ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x128167ed run>"}
[success (17.000s)]

even '\' should be a valid string according to the docs

@HonzaKral which docs? I haven't been able to find unambiguous documentation on this bit, and whether or not we have the config.support_escapes enabled, the parser has always understood a backslash-escaped quote to be a part of a quoted string and not its termination. Without the config.support_escapes enabled it simply failed to process the escape sequence.

honzakral commented 6 years ago

Sorry, I meant backslash of course, it was a typo, I edited my comment and fixed it as not to confuse more people, my bad!

It did understand it but then the docs for support_escapes say:

\" becomes a literal double quotation mark. \' becomes a literal quotation mark.

which implies that without this setting it is not so.

yaauie commented 6 years ago

which implies that without this setting it is not so

Correct, but there is more than one way for it to be "not so" :weary:

Unfortunately, as with most escaping problems, this one is like an onion with all of the layers.

For both strings and regexps, the parser has always allowed users to escape a close-quote character from the parser (' or " in single- and double-quoted strings, / in regexps) by prefixing it with a backslash; this merely prevents the parser from identifying that quote character as a close-quote for the string or regexp it is in the middle of parsing, so the whole escape sequence is emitted from the parser without modification.

Our Regexp engine will gladly take the passthrough escape sequence, or it will ignore meaningless escapes, which combine to make it extra forgiving.

On Strings though, emitting un-processed escape sequences from the parser was problematic. Many plugins either worked around the above behaviour (by implementing their own escape-sequence processing) or exploited it (e.g., by not escaping strings that are used to create plugin-internal regexp, like kv's field_split directive), which makes fixing it globally problematic. Additionally, Windows users who are less familiar with backslash-as-escape-character tend to use it literally and unescaped in file paths). So we made support for processing escape sequences in strings an opt-in feature with config.support_escapes.


In order to specify a literal backslash in a quoted string or regexp in a pipeline config, you need to do one of two things:

While the former logically prevents strings from ending with unescaped backslashes, this ticket revolves around a bug with the latter, where an escaped backslash at the end of a string or regexp is being interpreted by the parser as literal-backslash-escaped-quote instead of escaped-backslash-close-quote`.

In your case, once this bug is fixed, with config.support_escapes enabled, you will be able to specify a solitary literal backslash in kv's field_split with the following:

filter {
  kv {
    field_split => "\\\\"
  }
}

Yes, that is four backslashes:

phase interpretation
config open-quote escape backslash escape backslash close-quote
plugin string holding (literal-backslash literal-backslash)
regexp escape backslash

Without config.support_escapes enabled, that would only be two backslashes:

filter {
  kv {
    field_split => "\\"
  }
}
phase interpretation
config open-quote escape backslash close-quote
plugin string holding (literal-backslash literal-backslash)
regexp escape backslash
colinsurprenant commented 6 years ago

Thanks @yaauie for the very detailed analysis of the problem. What do you recommend moving forward? Your suggested changes to the regexp rule? We should probably also improve the documentation around that?

@HonzaKral does the above explanation and solution makes sense with your issue?

honzakral commented 6 years ago

@colinsurprenant as long as I can do what I needed, I am ok. It would be great if it were documented in a way that would allow people to reason about what's going on - the way I undertsand it is that the escaping is done/interpretted twice, once (optionally) by the parser and once by the filter (I would also assume that different filters have different behaviors?)

colinsurprenant commented 6 years ago

@HonzaKral yeah, this is what I understand and one of the specific problem is related to plugins that require a regex option like the kv filter. The config parser has no notion of a regex option for plugins so a regex must be passed as a string option.

@yaauie our parser already has a notion of regex but for config expressions in conditionals, maybe we could look into introducing a regex option type for plugins? Do you think that could help?

honzakral commented 6 years ago

@colinsurprenant the option for kv is not a regex, just a string, a list of characters (0). Solving this for regexes won't help here.

The ultimate issue here is imho that both the parser and then the individual filters (sometimes) try and resolve the escaping. This leads to at best the need for double quotes at worst the inability to do something like specifying / as a string value. The incosistency where the parser always (regardless of escaping settings) parses \" but doesn't actually evaluate it as " is just adding confusion.

I believe the proper solution would be to unify the escaping rules by having the parser deal with it and hand over already unescaped string values to the plugins. It would also be possible to leave this to the plugins only but That seems less ideal to me as it would be implemented in many places instead of one.

0 - https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html#plugins-filters-kv-field_split

colinsurprenant commented 6 years ago

@HonzaKral ah yes sorry, my regex comment was for options like "field_split_pattern" and for some reason I forgot your original comment was with "field_split".

yaauie commented 6 years ago

@yaauie our parser already has a notion of regex but for config expressions in conditionals, maybe we could look into introducing a regex option type for plugins? Do you think that could help? -- @colinsurprenant in issue comment

@colinsurprenant this issue is around literal regexp in the pipelines; literal strings, which are interpreted by a plugin to be a regexp, fall into the config.support_escapes camp (literal strings are also affected by a similar bug)


What do you recommend moving forward? -- @colinsurprenant in issue comment

If you see no major flaw in my analysis of the regexp and why we fail to parse strings- or regexp-ending-in-escaped-backslashes, I can push forward and get to a PR with the updated treetop grammar and generated parser.


I believe the proper solution would be to unify the escaping rules by having the parser deal with it and hand over already unescaped string values to the plugins. It would also be possible to leave this to the plugins only but That seems less ideal to me as it would be implemented in many places instead of one. -- @HonzaKral in issue comment

Your solution is that of config.support_escapes, and the issue outlined in this ticket would resolve the one remaining edge-case: when attempting to end a string with a backslash.

honzakral commented 6 years ago

@yaauie even with support_escapes this calls for \\\\ which is not intuitive and trips people up, that is why I would not consider it the best solution which in my mind still lies in consolidating the escape handling into the parser where it imho belongs.

Of course I understand that that is a big change and this solution will fix the corner case which is causing problems.

yaauie commented 6 years ago

I agree that "\\\\" isn't the most intuitive way to express a single backslash, and am open to unambiguous ways to make it better.

In this case, the kv plugin really shouldn't be injecting verbatim inputs into a regexp, and should perform the necessary escaping of any inputs it receives instead of requiring the user to add a layer of escaping into the current options because of this shortcoming. I've filed logstash-plugins/logstash-filter-kv#66 as a potential solution.

Additionally, @colinsurprenant's suggestion of making regexp an option type for plugins could also be helpful. I'll open up a separate design discussion around that, since there may a bit of a chicken-vs-egg situation implementing in plugins that can be installed into Logstashes that don't yet have it implemented.

colinsurprenant commented 6 years ago

@yaauie

since there may a bit of a chicken-vs-egg situation implementing in plugins that can be installed into Logstashes that don't yet have it implemented.

Right. This is off-topic here but maybe for 7.0 we could discuss how to improve plugins compatibility vs logstash versions/features. That was the original intent of the logstash-core-plugin-api versioning. 7.0 could be an opportunity to reboot that correctly.

colinsurprenant commented 6 years ago

@HonzaKral WDYT about the field_split_char idea in https://github.com/logstash-plugins/logstash-filter-kv/issues/66 ?

@yaauie

If you see no major flaw in my analysis of the regexp and why we fail to parse strings- or regexp-ending-in-escaped-backslashes, I can push forward and get to a PR with the updated treetop grammar and generated parser.

Yes, I think we should move forward.

honzakral commented 6 years ago

@colinsurprenant sounds good, though I would consider it more of a bug fix so I wouldn't mind the field_split arg to be fixed. After all specifying \\\\ will will work since the regext will then just contain the backlash twice which is functionally the same

blookot commented 4 years ago

I'm referencing tests I'm running on dissect: https://discuss.elastic.co/t/dissect-with-a-backslash-at-the-end-of-the-log-line/216630 any case anyone wants to contribute to the discuss thread as well :-) Thanks!

blookot commented 4 years ago

PS This issue does not apply to regex! I haven't set config.support_escapes: true (I think it's false by default). I'm using \ only to escape " and this works perfectly well except when there is a \ at the very end of the string. I tried setting config.support_escapes: true and using a \\ before the " that ends my dissect mapping but that wouldn't work either. Thanks in advance for any update on this issue!

honzakral commented 3 years ago

@TheVastyDeep Not if you are using it for splitting, then there is a vast difference when you add .* at the end