"\"-character differences between String vs Regex

drosehn commented 8 years ago

Here's me tripping over another minor difference between ruby and crystal.

Given the code:

#   Octal \007 = \a in some contexts
tab_str = "MiscLabel"
new_str = "\e]1;%s\a" % tab_str
new_str =~ /\a/

Crystal playground will show: "MiscLabel" : String "\e]1;MiscLabela" : String nil : (Int32 | Nil)

In strings, ruby treats '\a' as the '\007' character, aka BEL. Crystal does not, which isn't necessarily a problem. But crystal makes use of some other library for doing regex's, and in that library '\a' is treated the same as '\007'. You can see that the string ends with an 'a', and the regex fails because it's trying to find /\a/. So let's say we change the string to use the octal value '\007':

tab_str = "MiscLabel" new_str = "\e]1;%s\007" % tab_str new_str =~ /\a/

Crystal playground now shows: "MiscLabel" : String "\e]1;MiscLabel\u{7}" : String 13 : (Int32 | Nil)

So, that works, if the user knows to use '\007' in strings, and '\a' in regex's. But the playground shows that character value as \u{7} when it displays the string, so maybe the cleaner and more constant thing for the programmer to do is to use that form in both contexts. So try:

tab_str = "MiscLabel" new_str = "\e]1;%s\u{7}" % tab_str new_str =~ /\u{7}/

but that gives you: "MiscLabel" : String "\e]1;MiscLabel\u{7}" : String Syntax error in :12: invalid regex: PCRE does not support \L, \l, \N{name}, \U, or \u at 1 CLOSE

(I'm cheating a little there, because you won't actually see those first two lines unless you comment-out the regex, and then uncomment it and re-run to get the error message).

I won't show it here, but "\007" works fine because it results in the same value in both contexts.

I happened to be copying some code which had '\a' so that's the one I know about. I didn't check any of the other less-common "\"-character codes. And for what it's worth, this was in code that I was copying from a bash shell, which also supports "\a" for "\007".

asterite commented 8 years ago

Side comment: to have nice syntax coloring for code, please use:

```cr
code
```

So it will be shown like this:

# Octal \007 = \a in some contexts
tab_str = "MiscLabel"
new_str = "\e]1;%s\a" % tab_str
new_str =~ /\a/

drosehn commented 8 years ago

Ah, yes, that's nicer than doing it on a per-line basis (as I did). And easier, too! I didn't realize that three "`"s in a row would preserve the newlines. Thanks!

drosehn commented 8 years ago

I edited the comment and changed just the first block of code, to make sure I understood how to do it. And good that I did it, because I had overlooked the 'cr'!

asterite commented 8 years ago

It seems to be that \a is understood by PCRE as that codepoint... I don't know if there is anything we can do. To make it work we can either add \a to our escapes list (I wouldn't mind this) or implement our own regex library (kind of hard :-))

drosehn commented 8 years ago

Comparing what is supported by 'echo' in bash, by strings in ruby, and by the PCRE package that crystal uses, I think that it would be good for crystal strings to include support for:

\a     alert (bell) = "\x07"
\b     backspace    = "\x08"
\xHH   the eight-bit character whose value is the
       hexadecimal value HH (one or two hex digits)

On that last one, here's what crystal currently does with '\xHH':

#  \x41 is the hex-value for 'A'
ch = "x41" ;  chkstr = "==\x41=="
if chkstr =~ /\x41/ ; printf "Match %s on '%s'\n", ch, chkstr
else                  printf "Miss  %s on '%s'\n", ch, chkstr ; end

ch = "x41a" ; chkstr = "==AA=="
if chkstr =~ /\x41/ ; printf "Match %s on '%s'\n", ch, chkstr
else                  printf "Miss  %s on '%s'\n", ch, chkstr ; end

Results in:

Miss  x41 on '==x41=='
Match x41a on '==AA=='

which shows that PCRE is replacing the \xHH with the hex-based character.

I also wonder if maybe crystal should complain about '\' followed by any alphabetic-character which crystal does not have special meaning for, in case it turns out you want to add a special meaning to that escape at some time in the future.

asterite commented 8 years ago

We don't support \x.. becuase we want string literals to always be valid UTF-8 strings.

drosehn commented 8 years ago

Maybe have crystal treat \x.. in strings the same way PCRE treats \u{..} in regexs. Make it a compile-time error, and don't allow it.

straight-shoota commented 6 years ago

As an update, \xHH escape is supported in string literals since 0.21.0.

pp "a\a" =~ /\a/,         # => nil
   "a\007" =~ /\007/,     # => 1
   "a\x07" =~ /\x07/      # => 1
   # "a\u{7}" =~ /\u{7}/, # => PCRE does not support \L, \l, \N{name}, \U, or \u

Creating a Regex from a string instead a regex literal uses Crystal string escape rules which means \a is consistently interpreted as an a characters instead of \007. This also allows to use \u in the regex.

pp "a\a" =~ Regex.new("\a"),       # => 0
   "a\007" =~ Regex.new("\007"),   # => 1
   "a\u{7}" =~ Regex.new("\u{7}"), # => 1

These results are actually all consistent except for "a\a" =~ /\a/ because of the different treatment of \a.

Could we just parse these regex literals as Crystal strings before giving them to PCRE instead of relaying them verbatim? This would also enable the use of \u escapes. Of course, this would need some extra handling for regex special characters and such... but interpolation is already supported in regex literals, so they seem to be modified by the compiler anyway?

straight-shoota commented 6 years ago

Could just add \a as a control sequence to Crystal string literals as well. I don't know... it doesn't seem to be as widely supported as the others, though. It was added in C89. Ruby, Python etc. have it, too.

asterite commented 6 years ago

\a was also added recently.

That leaves \u{...} as the only remaining problem. We should probably parse the regex contents almost as a string literal, and then convert that to the equivalent pcre string.

HertzDevil commented 1 year ago

PCRE2 accepts the \u... and \u{...} syntaxes if PCRE2_EXTRA_ALT_BSUX was defined during compilation of a pattern:

Support is available for some ECMAScript (aka JavaScript) escape sequences via two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed by { is not recognized. Only if \x is followed by two hexadecimal digits is it recognized as a character escape. Otherwise it is interpreted as a literal "x" character. In this mode, support for code points greater than 256 is provided by \u, which must be followed by four hexadecimal digits; otherwise it is interpreted as a literal "u" character.

PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition, \u{hhh..} is recognized as the character specified by hexadecimal code point. There may be any number of hexadecimal digits. This syntax is from ECMAScript 6.

Also see https://www.pcre.org/current/doc/html/pcre2api.html#extracompileoptions:

PCRE2_EXTRA_ALT_BSUX

The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in the way that ECMAscript (aka JavaScript) does. Additional functionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as a hexadecimal character code, where hhh.. is any number of hexadecimal digits.

We only have bindings for LibPCRE2::ALT_BSUX, not the extra compilation options yet, which are defined here for example; nor code to use the extra options themselves (pcre2_compile_context_create, pcre2_set_compile_extra_options).

straight-shoota commented 1 year ago

Implementation-wise I don't think it should be much effort to add in creating a compiler context and set extra options.

But foremost we'll need to figure out how to transition the regex syntax to PCRE2 (#12857).

crystal-lang / crystal

"\"-character differences between String vs Regex #3078