Open yaauie opened 6 years ago
(note: I believe our grammar behaves in similar ways with double- and single-quoted strings that end in backslashes, and that a similar fix may be non-breaking there too)
I believe I am hitting this when trying to set field_split
in kv
filter to just \
.
The proposed fix works only when config.support_escapes
is set to true
but if it is false then even '\'
should be a valid string according to the docs. The setting needs to be taken into account when parsing, not after like now.
I believe I am hitting this when trying to set
field_split
inkv
filter to just/
.
Did you mean forward-slash here? A forward-slash works just fine as a field splitter as evidenced below, but a lone backslash will fail in parsing as described above (regardless of the value of config.support_escapes
).
╭─{ yaauie@castrovel:~/src/elastic/pr-scratch/logstash-9701-backslashes }
╰─○ echo "foo=bar/baz=bingo" | logstash-6.2.4/bin/logstash -e 'filter { kv { field_split => "/" source => "message" } }'
...
[2018-06-14T17:23:37,265][INFO ][logstash.agent ] Pipelines running {:count=>1, :pipelines=>["main"]}
{
"baz" => "bingo",
"foo" => "bar",
"host" => "castrovel.local",
"@timestamp" => 2018-06-14T17:23:37.223Z,
"type" => "stdin",
"message" => "foo=bar/baz=bingo",
"@version" => "1"
}
[2018-06-14T17:23:37,559][INFO ][logstash.pipeline ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x128167ed run>"}
[success (17.000s)]
even
'\'
should be a valid string according to the docs
@HonzaKral which docs? I haven't been able to find unambiguous documentation on this bit, and whether or not we have the config.support_escapes
enabled, the parser has always understood a backslash-escaped quote to be a part of a quoted string and not its termination. Without the config.support_escapes
enabled it simply failed to process the escape sequence.
Sorry, I meant backslash of course, it was a typo, I edited my comment and fixed it as not to confuse more people, my bad!
It did understand it but then the docs for support_escapes
say:
\"
becomes a literal double quotation mark.\'
becomes a literal quotation mark.
which implies that without this setting it is not so.
which implies that without this setting it is not so
Correct, but there is more than one way for it to be "not so" :weary:
Unfortunately, as with most escaping problems, this one is like an onion with all of the layers.
For both strings and regexps, the parser has always allowed users to escape a close-quote character from the parser ('
or "
in single- and double-quoted strings, /
in regexps) by prefixing it with a backslash; this merely prevents the parser from identifying that quote character as a close-quote for the string or regexp it is in the middle of parsing, so the whole escape sequence is emitted from the parser without modification.
Our Regexp engine will gladly take the passthrough escape sequence, or it will ignore meaningless escapes, which combine to make it extra forgiving.
On Strings though, emitting un-processed escape sequences from the parser was problematic. Many plugins either worked around the above behaviour (by implementing their own escape-sequence processing) or exploited it (e.g., by not escaping strings that are used to create plugin-internal regexp, like kv
's field_split
directive), which makes fixing it globally problematic. Additionally, Windows users who are less familiar with backslash-as-escape-character tend to use it literally and unescaped in file paths). So we made support for processing escape sequences in strings an opt-in feature with config.support_escapes
.
In order to specify a literal backslash in a quoted string or regexp in a pipeline config, you need to do one of two things:
"
or '
for strings, /
for regexp); ORconfig.support_escapes
to true
to ensure escape sequences are processed in strings).While the former logically prevents strings from ending with unescaped backslashes, this ticket revolves around a bug with the latter, where an escaped backslash at the end of a string or regexp is being interpreted by the parser as literal-backslash
-escaped-quote
instead of escaped-backslash
-close-quote`.
In your case, once this bug is fixed, with config.support_escapes
enabled, you will be able to specify a solitary literal backslash in kv
's field_split
with the following:
filter {
kv {
field_split => "\\\\"
}
}
Yes, that is four backslashes:
phase | interpretation |
---|---|
config | open-quote escape backslash escape backslash close-quote |
plugin | string holding (literal-backslash literal-backslash ) |
regexp | escape backslash |
Without config.support_escapes
enabled, that would only be two backslashes:
filter {
kv {
field_split => "\\"
}
}
phase | interpretation |
---|---|
config | open-quote escape backslash close-quote |
plugin | string holding (literal-backslash literal-backslash ) |
regexp | escape backslash |
Thanks @yaauie for the very detailed analysis of the problem. What do you recommend moving forward? Your suggested changes to the regexp rule? We should probably also improve the documentation around that?
@HonzaKral does the above explanation and solution makes sense with your issue?
@colinsurprenant as long as I can do what I needed, I am ok. It would be great if it were documented in a way that would allow people to reason about what's going on - the way I undertsand it is that the escaping is done/interpretted twice, once (optionally) by the parser and once by the filter (I would also assume that different filters have different behaviors?)
@HonzaKral yeah, this is what I understand and one of the specific problem is related to plugins that require a regex option like the kv filter. The config parser has no notion of a regex option for plugins so a regex must be passed as a string option.
@yaauie our parser already has a notion of regex but for config expressions in conditionals, maybe we could look into introducing a regex option type for plugins? Do you think that could help?
@colinsurprenant the option for kv
is not a regex, just a string, a list of characters (0). Solving this for regexes won't help here.
The ultimate issue here is imho that both the parser and then the individual filters (sometimes) try and resolve the escaping. This leads to at best the need for double quotes at worst the inability to do something like specifying /
as a string value. The incosistency where the parser always (regardless of escaping settings) parses \"
but doesn't actually evaluate it as "
is just adding confusion.
I believe the proper solution would be to unify the escaping rules by having the parser deal with it and hand over already unescaped string values to the plugins. It would also be possible to leave this to the plugins only but That seems less ideal to me as it would be implemented in many places instead of one.
@HonzaKral ah yes sorry, my regex comment was for options like "field_split_pattern" and for some reason I forgot your original comment was with "field_split".
@yaauie our parser already has a notion of regex but for config expressions in conditionals, maybe we could look into introducing a regex option type for plugins? Do you think that could help? -- @colinsurprenant in issue comment
@colinsurprenant this issue is around literal regexp in the pipelines; literal strings, which are interpreted by a plugin to be a regexp, fall into the config.support_escapes
camp (literal strings are also affected by a similar bug)
What do you recommend moving forward? -- @colinsurprenant in issue comment
If you see no major flaw in my analysis of the regexp and why we fail to parse strings- or regexp-ending-in-escaped-backslashes, I can push forward and get to a PR with the updated treetop grammar and generated parser.
I believe the proper solution would be to unify the escaping rules by having the parser deal with it and hand over already unescaped string values to the plugins. It would also be possible to leave this to the plugins only but That seems less ideal to me as it would be implemented in many places instead of one. -- @HonzaKral in issue comment
Your solution is that of config.support_escapes
, and the issue outlined in this ticket would resolve the one remaining edge-case: when attempting to end a string with a backslash.
@yaauie even with support_escapes
this calls for \\\\
which is not intuitive and trips people up, that is why I would not consider it the best solution which in my mind still lies in consolidating the escape handling into the parser where it imho belongs.
Of course I understand that that is a big change and this solution will fix the corner case which is causing problems.
I agree that "\\\\"
isn't the most intuitive way to express a single backslash, and am open to unambiguous ways to make it better.
In this case, the kv plugin really shouldn't be injecting verbatim inputs into a regexp, and should perform the necessary escaping of any inputs it receives instead of requiring the user to add a layer of escaping into the current options because of this shortcoming. I've filed logstash-plugins/logstash-filter-kv#66 as a potential solution.
Additionally, @colinsurprenant's suggestion of making regexp an option type for plugins could also be helpful. I'll open up a separate design discussion around that, since there may a bit of a chicken-vs-egg situation implementing in plugins that can be installed into Logstashes that don't yet have it implemented.
@yaauie
since there may a bit of a chicken-vs-egg situation implementing in plugins that can be installed into Logstashes that don't yet have it implemented.
Right. This is off-topic here but maybe for 7.0 we could discuss how to improve plugins compatibility vs logstash versions/features. That was the original intent of the logstash-core-plugin-api
versioning. 7.0 could be an opportunity to reboot that correctly.
@HonzaKral WDYT about the field_split_char
idea in https://github.com/logstash-plugins/logstash-filter-kv/issues/66 ?
@yaauie
If you see no major flaw in my analysis of the regexp and why we fail to parse strings- or regexp-ending-in-escaped-backslashes, I can push forward and get to a PR with the updated treetop grammar and generated parser.
Yes, I think we should move forward.
@colinsurprenant sounds good, though I would consider it more of a bug fix so I wouldn't mind the field_split
arg to be fixed. After all specifying \\\\
will will work since the regext will then just contain the backlash twice which is functionally the same
I'm referencing tests I'm running on dissect: https://discuss.elastic.co/t/dissect-with-a-backslash-at-the-end-of-the-log-line/216630 any case anyone wants to contribute to the discuss thread as well :-) Thanks!
PS This issue does not apply to regex!
I haven't set config.support_escapes: true
(I think it's false
by default).
I'm using \
only to escape "
and this works perfectly well except when there is a \
at the very end of the string.
I tried setting config.support_escapes: true
and using a \\
before the "
that ends my dissect mapping but that wouldn't work either.
Thanks in advance for any update on this issue!
@TheVastyDeep Not if you are using it for splitting, then there is a vast difference when you add .*
at the end
An issue in the treetop grammar can cause the
regexp
to be too aggressive at capturing, which can ultimately cause the pipeline compilation to fail when a pattern ends with a backslash (even if that backslash is backslash-escaped).I believe I know how to fix this, but would like to discuss it to consider the risk before we tackle the compilation of the treetop grammar, since the tooling to compile the grammar does not immediately appear to be working for me and fixing it may take some effort.
The treetop grammar currently defines a regexp as:
In plain English that is:
'/'
)'\/'
); OR!'/' .
)'/'
)note: rule 2b uses a literal "Negative Lookahead Assertion" to prevent it from matching forward slashes
Using this definition, in a config file the sequence of characters
/foo\\bar/
will correctly generate a Regexp that matches the literal stringfoo\bar
, but the pattern/fubar\\/
, which attempts to match the literal stringfubar\
will capture the intended (3) closing forwardslash as (2a), fail to close, and continue capturing until it either hits EOL or an unescaped forward slash from another expression, typically resulting in a pipeline compilation error.The below table shows where the parser captures too eagerly; each parenthised group in the parse column indicates the rule that caused the pattern to match, followed by the match.
/foo/
/
)(2bf
)(2bo
)(2bo
)(3/
)/foo\\bar/
/
)(2bf
)(2bo
)(2bo
)(2b\
)(2b\
)(2bb
)(2ba
)(2br
)(3/
)/foo\/bar/
/
)(2bf
)(2bo
)(2bo
)(2a\/
)(2bb
)(2ba
)(2br
)(3/
)/fubar\\/
/
)(2bf
)(2bu
)(2bb
)(2ba
)(2br
)(2b\
)(2a\/
)...I believe that we can redefine the grammar for regexp to capture escape sequences better without significant risk:
In plain English that is:
'/'
)'\' .
); OR!'/' .
)'/'
)This will slightly change the meaning of how escape sequences get parsed (e.g., sequences that matched 2b twice before may match 2a now), but because backslashes and the character that followed them were always captured and this change does nothing to process the escape sequences, this fix poses no external risk.
/foo/
/
)(2bf
)(2bo
)(2bo
)(3/
)/foo\\bar/
/
)(2bf
)(2bo
)(2bo
)(2a\\
)(2bb
)(2ba
)(2br
)(3/
)/foo\/bar/
/
)(2bf
)(2bo
)(2bo
)(2a\/
)(2bb
)(2ba
)(2br
)(3/
)/fubar\\/
/
)(2bf
)(2bu
)(2bb
)(2ba
)(2br
)(2a\/
)(3\/
)...