Open drosehn opened 8 years ago
Side comment: to have nice syntax coloring for code, please use:
```cr code ```
So it will be shown like this:
# Octal \007 = \a in some contexts
tab_str = "MiscLabel"
new_str = "\e]1;%s\a" % tab_str
new_str =~ /\a/
Ah, yes, that's nicer than doing it on a per-line basis (as I did). And easier, too! I didn't realize that three "`"s in a row would preserve the newlines. Thanks!
I edited the comment and changed just the first block of code, to make sure I understood how to do it. And good that I did it, because I had overlooked the 'cr'!
It seems to be that \a
is understood by PCRE as that codepoint... I don't know if there is anything we can do. To make it work we can either add \a
to our escapes list (I wouldn't mind this) or implement our own regex library (kind of hard :-))
Comparing what is supported by 'echo' in bash, by strings in ruby, and by the PCRE package that crystal uses, I think that it would be good for crystal strings to include support for:
\a alert (bell) = "\x07"
\b backspace = "\x08"
\xHH the eight-bit character whose value is the
hexadecimal value HH (one or two hex digits)
On that last one, here's what crystal currently does with '\xHH':
# \x41 is the hex-value for 'A'
ch = "x41" ; chkstr = "==\x41=="
if chkstr =~ /\x41/ ; printf "Match %s on '%s'\n", ch, chkstr
else printf "Miss %s on '%s'\n", ch, chkstr ; end
ch = "x41a" ; chkstr = "==AA=="
if chkstr =~ /\x41/ ; printf "Match %s on '%s'\n", ch, chkstr
else printf "Miss %s on '%s'\n", ch, chkstr ; end
Results in:
Miss x41 on '==x41=='
Match x41a on '==AA=='
which shows that PCRE is replacing the \xHH with the hex-based character.
I also wonder if maybe crystal should complain about '\' followed by any alphabetic-character which crystal does not have special meaning for, in case it turns out you want to add a special meaning to that escape at some time in the future.
We don't support \x..
becuase we want string literals to always be valid UTF-8 strings.
Maybe have crystal treat \x..
in strings the same way PCRE treats \u{..}
in regexs. Make it a compile-time error, and don't allow it.
As an update, \xHH
escape is supported in string literals since 0.21.0.
pp "a\a" =~ /\a/, # => nil
"a\007" =~ /\007/, # => 1
"a\x07" =~ /\x07/ # => 1
# "a\u{7}" =~ /\u{7}/, # => PCRE does not support \L, \l, \N{name}, \U, or \u
Creating a Regex
from a string instead a regex literal uses Crystal string escape rules which means \a
is consistently interpreted as an a
characters instead of \007
. This also allows to use \u
in the regex.
pp "a\a" =~ Regex.new("\a"), # => 0
"a\007" =~ Regex.new("\007"), # => 1
"a\u{7}" =~ Regex.new("\u{7}"), # => 1
These results are actually all consistent except for "a\a" =~ /\a/
because of the different treatment of \a
.
Could we just parse these regex literals as Crystal strings before giving them to PCRE instead of relaying them verbatim? This would also enable the use of \u
escapes.
Of course, this would need some extra handling for regex special characters and such... but interpolation is already supported in regex literals, so they seem to be modified by the compiler anyway?
Could just add \a
as a control sequence to Crystal string literals as well. I don't know... it doesn't seem to be as widely supported as the others, though.
It was added in C89. Ruby, Python etc. have it, too.
\a
was also added recently.
That leaves \u{...}
as the only remaining problem. We should probably parse the regex contents almost as a string literal, and then convert that to the equivalent pcre string.
Support is available for some ECMAScript (aka JavaScript) escape sequences via two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed by { is not recognized. Only if \x is followed by two hexadecimal digits is it recognized as a character escape. Otherwise it is interpreted as a literal "x" character. In this mode, support for code points greater than 256 is provided by \u, which must be followed by four hexadecimal digits; otherwise it is interpreted as a literal "u" character.
PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition, \u{hhh..} is recognized as the character specified by hexadecimal code point. There may be any number of hexadecimal digits. This syntax is from ECMAScript 6.
Also see https://www.pcre.org/current/doc/html/pcre2api.html#extracompileoptions:
PCRE2_EXTRA_ALT_BSUX
The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u, and \x in the way that ECMAscript (aka JavaScript) does. Additional functionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has the effect of PCRE2_ALT_BSUX, but in addition it recognizes \u{hhh..} as a hexadecimal character code, where hhh.. is any number of hexadecimal digits.
We only have bindings for LibPCRE2::ALT_BSUX
, not the extra compilation options yet, which are defined here for example; nor code to use the extra options themselves (pcre2_compile_context_create
, pcre2_set_compile_extra_options
).
Implementation-wise I don't think it should be much effort to add in creating a compiler context and set extra options.
But foremost we'll need to figure out how to transition the regex syntax to PCRE2 (#12857).
Here's me tripping over another minor difference between ruby and crystal.
Given the code:
Crystal playground will show:
"MiscLabel"
:String
"\e]1;MiscLabela"
:String
nil
:(Int32 | Nil)
In strings, ruby treats '\a' as the '\007' character, aka BEL. Crystal does not, which isn't necessarily a problem. But crystal makes use of some other library for doing regex's, and in that library '\a' is treated the same as '\007'. You can see that the string ends with an 'a', and the regex fails because it's trying to find /\a/. So let's say we change the string to use the octal value '\007':
tab_str = "MiscLabel"
new_str = "\e]1;%s\007" % tab_str
new_str =~ /\a/
Crystal playground now shows:
"MiscLabel"
:String
"\e]1;MiscLabel\u{7}"
:String
13
:(Int32 | Nil)
So, that works, if the user knows to use '\007' in strings, and '\a' in regex's. But the playground shows that character value as
\u{7}
when it displays the string, so maybe the cleaner and more constant thing for the programmer to do is to use that form in both contexts. So try:tab_str = "MiscLabel"
new_str = "\e]1;%s\u{7}" % tab_str
new_str =~ /\u{7}/
but that gives you:
"MiscLabel"
:String
"\e]1;MiscLabel\u{7}"
:String
Syntax error in :12: invalid regex: PCRE does not support \L, \l, \N{name}, \U, or \u at 1 CLOSE(I'm cheating a little there, because you won't actually see those first two lines unless you comment-out the regex, and then uncomment it and re-run to get the error message).
I won't show it here, but "\007" works fine because it results in the same value in both contexts.
I happened to be copying some code which had '\a' so that's the one I know about. I didn't check any of the other less-common "\"-character codes. And for what it's worth, this was in code that I was copying from a bash shell, which also supports "\a" for "\007".