Perl / perl5

๐Ÿช The Perl programming language
https://dev.perl.org/perl5/
Other
1.99k stars 557 forks source link

quotemeta() fails to quote literal non-word character under utf8 #10602

Closed p5pRT closed 12 years ago

p5pRT commented 14 years ago

Migrated from rt.perl.org#77654 (status was 'resolved')

Searchable as RT77654$

p5pRT commented 14 years ago

From mncharity@vendian.org

Created by mncharity@vendian.org

quotemeta() fails to quote a CENT SIGN when\, using utf8\, the string is created with a literal CENT SIGN character\, instead of with \xA2 .

----

use utf8; use Test; plan( tests => 21 );

# Bug Synopsis

# quotemeta() fails to quote a CENT SIGN when\, # using utf8\, the string is created with # a literal CENT SIGN character\, instead of with \xA2 .

ok("ยข"\,"\xA2"); # ok ok(quotemeta("\xA2")\,"\\ยข"); # ok

ok(quotemeta("ยข")\,"\\ยข"); # NOT OK ok(quotemeta("ยข")\,quotemeta("\xA2")); # NOT OK

# Bug Demonstration

my $a = "ยข"; my $b = "\xA2"; ok($a\,$b); ok(($a eq $b)\,1); ok(quotemeta($a)\,quotemeta($b)); # NOT OK my $quoted = "\\\xA2"; ok("\\".$a\,$quoted); ok("\\".$b\,$quoted); ok(quotemeta($a)\,$quoted); # NOT OK ok(quotemeta($b)\,$quoted); # ok

# Additional notes

# CENT SIGN is \xA2 ok("ยข"\,"\xA2"); # CENT SIGN is not a word character ok("a"=~/\w/\,1); ok("a"=~/\W/\,""); ok("ยข"=~/\p{IsWord}/\,""); ok("ยข"=~/\P{IsWord}/\,1); ok("ยข"=~/\w/\,""); ok("ยข"=~/\W/\,1); # Regexps behave correctly my $s; $s = "ยข"; $s =~ s/([^A-Za-z_0-9])/\\$1/g; ok($s\,$quoted); $s = "ยข"; $s =~ s/(\P{IsWord})/\\$1/g; ok($s\,$quoted); $s = "ยข"; $s =~ s/(\W)/\\$1/g; ok($s\,$quoted);

----

1..21 # Running under perl version 5.012001 for linux # Current time local​: Thu Sep 2 15​:42​:28 2010 # Current time GMT​: Thu Sep 2 19​:42​:28 2010 # Using Test.pm version 1.25_02 ok 1 ok 2 not ok 3 # Test 3 got​: "\xA2" (./bug.pl at line 14) # Expected​: "\\\xA2" # ./bug.pl line 14 is​: ok(quotemeta("ยข")\,"\\ยข"); # NOT OK not ok 4 # Test 4 got​: "\xA2" (./bug.pl at line 15) # Expected​: "\\\xA2" # ./bug.pl line 15 is​: ok(quotemeta("ยข")\,quotemeta("\xA2")); # NOT OK ok 5 ok 6 not ok 7 # Test 7 got​: "\xA2" (./bug.pl at line 24) # Expected​: "\\\xA2" # ./bug.pl line 24 is​: ok(quotemeta($a)\,quotemeta($b)); # NOT OK ok 8 ok 9 not ok 10 # Test 10 got​: "\xA2" (./bug.pl at line 28) # Expected​: "\\\xA2" # ./bug.pl line 28 is​: ok(quotemeta($a)\,$quoted); # NOT OK ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl 5.12.1: Configured by mncharity at Sun Jul 4 18:40:05 EDT 2010. Summary of my perl5 (revision 5 version 12 subversion 1) configuration: Platform: osname=linux, osvers=2.6.32-22-generic, archname=x86_64-linux uname='linux pencil 2.6.32-22-generic #36-ubuntu smp thu jun 3 19:31:57 utc 2010 x86_64 gnulinux ' config_args='-des -Dprefix=/usr/local/perl512' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.4.3', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /lib64 /usr/lib64 libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.11.1.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.11.1' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector' Locally applied patches: @INC for perl 5.12.1: /usr/local/perl512/lib/site_perl/5.12.1/x86_64-linux /usr/local/perl512/lib/site_perl/5.12.1 /usr/local/perl512/lib/5.12.1/x86_64-linux /usr/local/perl512/lib/5.12.1 . Environment for perl 5.12.1: HOME=/home/mncharity LANG=en_US.utf8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 13 years ago

From @iabyn

On Thu\, Sep 02\, 2010 at 12​:58​:16PM -0700\, Mitchell N Charity wrote​:

quotemeta() fails to quote a CENT SIGN when\, using utf8\, the string is created with a literal CENT SIGN character\, instead of with \xA2 .

This appears to be down to a difference in behaviour of quotemeta depending on whether the string is internally UTF-8 encoded or not.

For non-utf8 strings\, all chars *except* isALNUM() are \\-escaped; in particular\, chars with ords in the range 128-255 are always quoted.

For utf8 strings\, chars with ord > 127 are never quoted. I think this this is a bug that needs fixing\, but can anyone confirm or deny? In particular this would be be significant change in behaviour\, since currently the miriad of codepoints above 255 are not escaped\, including "letters" from non-latin character ranges. I would assume that all these should be quoted.

The current docs make it clear that all chars except [A-Za-z_0-9] should be escaped.

-- Monto Blanco... scorchio!

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From tchrist@perl.com

For utf8 strings\, chars with ord > 127 are never quoted. I think this this is a bug that needs fixing\, but can anyone confirm or deny?

I believe Unicode makes some guarantees regarding the stability of the Pattern_Syntax set of characters for just such an occasion\, but one should ask Karl for details there.

--tom

p5pRT commented 13 years ago

From tchrist@perl.com

For utf8 strings\, chars with ord > 127 are never quoted. I think this this is a bug that needs fixing\, but can anyone confirm or deny?

I've been thinking about this a bit more\, rereading UAX#31\, UTS#18\, and UTR#18. My first thought was that for Unicode\, that all \W characters should be quotemeta'd no matter what block their code points fall into. (That may be a problem\, though\, as I'll explain below.)

I *think* that is what Dave is suggesting\, not that merely all [^\x00-\x7F] also be quoted\, since that would violate certain first principles of what is and is not a metacharacter in a Perl pattern​: see elaboration underneath my signature.

But I have encountered a problem with that idea. Unicode defines certain characters as being Pattern_Syntax characters. It also defines certain characters as being Pattern_White_Space characters. It further guarantees that this set will never change\, so that you can future-proof your program.

The important bits are from​:

  http​://unicode.org/reports/tr31/#Pattern_Syntax

  As of Unicode4.1\, two Unicode character properties are defined to   provide for stable syntax​: Pattern_White_Space and Pattern_Syntax.   Particular pattern languages may\, of course\, override these   recommendations\, for example\, by adding or removing other characters   for compatibility with ASCII usage.

  For stability\, the values of these properties are absolutely invariant\,   not changing with successive versions of Unicode. Of course\, this does   not limit the ability of the Unicode Standard to encode more symbol or   whitespace characters\, but the syntax and whitespace code points   recommended for use in patterns will not change.

  When *generating* rules or patterns\, all whitespace and syntax code   points that are to be literals require quoting\, using whatever   quoting mechanism is available. For readability\, it is recommended   practice to quote or escape all literal whitespace and default   ignorable code points as well.

There's more there\, which should probably be studied before we do anything.

One would think that backslashing all \p{Pattern_Syntax} characters would be the right thing to do. There are 2417 Pattern_Syntax code points\, all of which are in the BMP\, none in the astral planes. But there is one code point which is both \w and yet considered pattern syntax; it's a \p{Lm} character​:

  % unichars -c '\p{Pattern_Syntax}' '\w'   โธฏ 11823 2E2F GC=Lm VERTICAL TILDE

I don't know whether that is a mistake or not. Karl?

There are also two code points that are Pattern_White_Space but not White_Space​:

  % unichars -c '\p{Pattern_White_Space}' '\P{White_Space}'   -- 8206 200E GC=Cf LEFT-TO-RIGHT MARK   -- 8207 200F GC=Cf RIGHT-TO-LEFT MARK

Which I'm not sure what to make up.

For what it's worth (which probably is nothing)\, there are 63 \p{Default_Ignorable_Code_Point} chars in the BMP\, or 49 if you discount \p{Cn} and HANGUL FILLER. There are rather more than that up in the astral planes because of the TAG and VARIATION SELECTOR stuff in the 0E0000 plane.

I believe there to be no changes to the sets of things I've talked about here for Unicode 6.0; at least\, I could find none.

--tom

ELABORATION​:

  The reason Perl quotes all \W characters is because of first principles   about what is and is not able to be a metacharacter in Perl patterns.

  That principle is that\, in patterns​:

  * a \w character never means anything special   * a \W character might mean something special

  Whence it follows that

  * backslashing a \w character might mean something special   * backslashing a \W character never means anything special

  In point of fact\, there are uniquely 12 and 12 only metacharacters   in Perl regexes\, the dirty dozen of​:

  \ | ( ) [ { ^ $ * + ? .

  The question becomes whether we want the flexibility to someday extend   our set of metacharacters beyond those 12. The quotemeta behavior of   backslashing any and all \W characters no matter what\, while always   leaving inviolate all \w characters\, was designed to provide for that.

  We've never drawn upon our the \W reservoir for other pattern matching   operations in Perl5\, but Perl6 has. For one thing\, it uses "\"   for circumfix quoting of subrules\, as in Perl5 one uses "(?&expr)"\, so   both "\<" and ">" are metachars. It also uses this notation for   Unicode properties\, with a colon in front of the property name​:

  \<​:Letter> # \pL   \<​:!Letter> # \PL

  \<​:East_Asian_Width\> # \p{EA=N}   \<​:!Blk\> # \P{Blk=ASCII}

  For another thing\, Perl6 uses "~" for matching nested subrules and uses   "&" for conjunction. Both of those\, and "|"\, can also be doubled\, but   doubling doesn't extend the set. The "" can be negated with a "!"\,   but all those are still ASCII \W characters.

  I do not know whether one can add new metacharacters in Perl6 patterns.   I wouldn't put it past them\, considering you can do so for regular   operators\, but I can't figure out whether you can. If somebody reading   http​://perlcabal.org/syn/S05.html can find something that says for sure   that this either *is* or else that it is *not* possible\, I'd be   interested in knowing.

  I have proposed that we adopt a way to specify character class union\,   intersection\, and subtraction. The Unicode documents talk about these   using simple + and -\, which one can actually use in Perl when defining   one's own property subroutines\, like

  sub IsKana {   return \<\<'END';   +utf8​::InHiragana   +utf8​::InKatakana   -utf8​::IsCn   END   }

  which was used back before we had a proper Kana property (we now do).

  Even if we did something Java's character class set mechanics (as I   have informally proposed)\, which uses [a[b]] for union\, [a&&b] for   intersection\, and the ungainly [a&&[^b]] for subtraction because [a-b]   was already taken\, whatever we may elect to do would likely fall within   square brackets and so follow a different ruleset.

  The Unicode documents use a cleaner syntax than Java's for talking   about these things\, looking more like Perl6\, although not quite the   same; for example\, Unicode has separate "--" and "~~" operators for set   and symmetric difference respectively.

p5pRT commented 13 years ago

From @khwilliamson

Tom Christiansen wrote​:

For utf8 strings\, chars with ord > 127 are never quoted. I think this this is a bug that needs fixing\, but can anyone confirm or deny?

I've been thinking about this a bit more\, rereading UAX#31\, UTS#18\, and UTR#18.

I believe that UTR18 and UTS18 are now the same document.

  My first thought was that for Unicode\, that all \W characters

should be quotemeta'd no matter what block their code points fall into. (That may be a problem\, though\, as I'll explain below.)

I *think* that is what Dave is suggesting\, not that merely all [^\x00-\x7F] also be quoted\, since that would violate certain first principles of what is and is not a metacharacter in a Perl pattern​: see elaboration underneath my signature.

But I have encountered a problem with that idea. Unicode defines certain characters as being Pattern_Syntax characters. It also defines certain characters as being Pattern_White_Space characters. It further guarantees that this set will never change\, so that you can future-proof your program.

Note that some of the code points in the sets are still unassigned\, so that gives Unicode some leeway to add things.

The important bits are from​:

http&#8203;://unicode\.org/reports/tr31/\#Pattern\_Syntax

As of Unicode4\.1\, two Unicode character properties are defined to
provide for stable syntax&#8203;: Pattern\_White\_Space and Pattern\_Syntax\.
Particular pattern languages may\, of course\, override these
recommendations\, for example\, by adding or removing other characters
for compatibility with ASCII usage\.

For stability\, the values of these properties are absolutely invariant\,
not changing with successive versions of Unicode\. Of course\, this does
not limit the ability of the Unicode Standard to encode more symbol or
whitespace characters\, but the syntax and whitespace code points
recommended for use in patterns will not change\.

When \*generating\* rules or patterns\, all whitespace and syntax code
points that are to be literals require quoting\, using whatever
quoting mechanism is available\. For readability\, it is recommended
practice to quote or escape all literal whitespace and default
ignorable code points as well\.

There's more there\, which should probably be studied before we do anything.

One would think that backslashing all \p{Pattern_Syntax} characters would be the right thing to do. There are 2417 Pattern_Syntax code points\, all of which are in the BMP\, none in the astral planes. But there is one code point which is both \w and yet considered pattern syntax; it's a \p{Lm} character​:

% unichars \-c '\\p\{Pattern\_Syntax\}' '\\w'
 โธฏ 11823 2E2F GC=Lm VERTICAL TILDE

I don't know whether that is a mistake or not. Karl?

I have emailed Unicode about this apparent discrepancy.

There are also two code points that are Pattern_White_Space but not White_Space​:

% unichars \-c '\\p\{Pattern\_White\_Space\}' '\\P\{White\_Space\}'
 \-\- 8206 200E GC=Cf LEFT\-TO\-RIGHT MARK
 \-\- 8207 200F GC=Cf RIGHT\-TO\-LEFT MARK

Which I'm not sure what to make up.

They are\, however\, default ignorable code points\, so it is recommended that they be quoted. See the discussion in section 2.3 of #31. Some implementations might want to allow them; I imagine that is why they aren't pattern white space.

For what it's worth (which probably is nothing)\, there are 63 \p{Default_Ignorable_Code_Point} chars in the BMP\, or 49 if you discount \p{Cn} and HANGUL FILLER. There are rather more than that up in the astral planes because of the TAG and VARIATION SELECTOR stuff in the 0E0000 plane.

I believe there to be no changes to the sets of things I've talked about here for Unicode 6.0; at least\, I could find none.

--tom

ELABORATION​:

The reason Perl quotes all \\W characters is because of first principles
about what is and is not able to be a metacharacter in Perl patterns\.

That principle is that\, in patterns&#8203;:

    \*  a \\w character never means anything special
    \*  a \\W character might mean something special

Whence it follows that

    \*  backslashing a \\w character might mean something special
    \*  backslashing a \\W character never means anything special

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

The question becomes whether we want the flexibility to someday extend
our set of metacharacters beyond those 12\.  The quotemeta behavior of
backslashing any and all \\W characters no matter what\, while always
leaving inviolate all \\w characters\, was designed to provide for that\.

We've never drawn upon our the \\W reservoir for other pattern matching
operations in Perl5\, but Perl6 has\.  For one thing\, it uses "\<expr>"
for circumfix quoting of subrules\, as in Perl5 one uses "\(?&expr\)"\, so
both "\<" and ">" are metachars\.   It also uses this notation for
Unicode properties\, with a colon in front of the property name&#8203;:

    \<&#8203;:Letter>                           \# \\pL
    \<&#8203;:\!Letter>                          \# \\PL

    \<&#8203;:East\_Asian\_Width\<Narrow>>         \# \\p\{EA=N\}
    \<&#8203;:\!Blk\<ASCII>>                      \# \\P\{Blk=ASCII\}

For another thing\, Perl6 uses "~" for matching nested subrules and uses
"&" for conjunction\.  Both of those\, and "|"\, can also be doubled\, but
doubling doesn't extend the set\.  The "~~" can be negated with a "\!~~"\,
but all those are still ASCII \\W characters\.

I do not know whether one can add new metacharacters in Perl6 patterns\.
I wouldn't put it past them\, considering you can do so for regular
operators\, but I can't figure out whether you can\.  If somebody reading
http&#8203;://perlcabal\.org/syn/S05\.html can find something that says for sure
that this either \*is\* or else that it is \*not\* possible\, I'd be
interested in knowing\.

I have proposed that we adopt a way to specify character class union\,
intersection\, and subtraction\.  The Unicode documents talk about these
using simple \+ and \-\, which one can actually use in Perl when defining
one's own property subroutines\, like

    sub IsKana \{
        return \<\<'END';
    \+utf8&#8203;::InHiragana
    \+utf8&#8203;::InKatakana
    \-utf8&#8203;::IsCn
    END
    \}

which was used back before we had a proper Kana property \(we now do\)\.

Even if we did something Java's character class set mechanics \(as I
have informally proposed\)\, which uses \[a\[b\]\] for union\, \[a&&b\] for
intersection\, and the ungainly \[a&&\[^b\]\] for subtraction because \[a\-b\]
was already taken\, whatever we may elect to do would likely fall within
square brackets and so follow a different ruleset\.

The Unicode documents use a cleaner syntax than Java's for talking
about these things\, looking more like Perl6\, although not quite the
same; for example\, Unicode has separate "\-\-" and "~~" operators for set
and symmetric difference respectively\.

So I don't know what to do. This may be complicated by the fact that Perl botched what are considered identifiers. My guess from the comments is that it stems from the fact that Unicode botched the definition of alpha between v1.9 and 3.0.1. sprout has gone in in 5.13 and fixed the definition so that it doesn't hang the parser\, but for backwards compatibility\, it doesn't match the Unicode identifier definition\, and that is somewhat bothersome to me.

The Unicode recommendation is to only quote the pattern white space and identifier characters plus the default ignorable code points. That means most controls would not get quoted.

I believe Tom has a better handle on the implications than me. I await his further ideas.

p5pRT commented 13 years ago

From tchrist@perl.com

SUMMARY​: I believe that if nothing substantial can be gained by   using the broader \W over what UAX#31 says to quote\, we   should use UAX#31's suggestions to implement quotemeta()   and \Q on Unicode.

  I also think we should use those suggestions if there were some   error that \W might introduce. The code points I show below   which are both on UAX#31's things to quote list but which also   happen to be \w characters suggest that there may be.

Karl wrote​:

One would think that backslashing all \p{Pattern_Syntax} characters would be the right thing to do. There are 2417 Pattern_Syntax code points\, all of which are in the BMP\, none in the astral planes. But there is one code point which is both \w and yet considered pattern syntax; it's a \p{Lm} character​:

% unichars \-c '\\p\{Pattern\_Syntax\}' '\\w'
 โธฏ 11823 2E2F GC=Lm VERTICAL TILDE

I don't know whether that is a mistake or not. Karl?

I have emailed Unicode about this apparent discrepancy.

Good\, thank you.

So I don't know what to do. This may be complicated by the fact that Perl botched what are considered identifiers. My guess from the comments is that it stems from the fact that Unicode botched the definition of alpha between v1.9 and 3.0.1. sprout has gone in in 5.13 and fixed the definition so that it doesn't hang the parser\, but for backwards compatibility\, it doesn't match the Unicode identifier definition\, and that is somewhat bothersome to me.

Could you please explain what that means\, that Unicode botched the definition of alpha between v1.9 and 3.0.1?

My working definitions of an alpha and an idenitifier charclass in Java work out to these​:

  alphabetic_charclass =   "["   + "\\pL" /* all Letters */   + "\\pM" /* all Marks */   + "\\p{Nl}" /* Letter Number */   + "]";

  identifier_charclass =   "["   + "\\pL" /* all Letters */   + "\\pM" /* all Marks */   + "\\p{Nd}" /* Decimal Number */   + "\\p{Nl}" /* Letter Number */   + "\\p{Pc}" /* Connector Punctuation */   + "[" /* or else chars which are both */   + "\\p{InEnclosedAlphanumerics}"   + "&&" /* and also */   + "\\p{So}" /* Other Symbol */   + "]"   + "]";

Now\, that's not quite the way #31's section 2 reads\, but it may be (close to) equivalent; I haven't checked. Hm\, I'm pretty sure that I have ZWJ and ZWNJ issues there\, something that I addressed in working out extended grapheme clusters but never backported to regular old identifier-class characters.

What part of the sense of "alpha" or "identifier" did Perl and Unicode part ways on? Is this perhaps only in Perl's parser\, not in its notions of properties? Does it have to do with #31 section 2\, or something else?

The Unicode recommendation is to only quote the pattern white space and identifier characters plus the default ignorable code points. That means most controls would not get quoted.

That would save on space.

That principle is that\, in patterns&#8203;:

    \*  a \\w character never means anything special
    \*  a \\W character might mean something special

Whence it follows that

    \*  backslashing a \\w character might mean something special
    \*  backslashing a \\W character never means anything special

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

Modulo the problematic U+2E2F\, I believe that quoting all \W characters is both safe and a proper superset of UAX#31. My only question is whether there is anything to be gained by reducing that superset down to quoting only those code points with any of

  Pattern_Syntax   Pattern_White_Space   Default_Ignorable_Code_Point

Let's for this discussion call those the Pattern_Quotable set\, or PQ.

The considerations are time and space. On space\, there are certainly more \W characters than there are QP characters. In the BMP​:

  % unichars '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' | wc -l   2475

  % unichars '\W' | wc -l   4137

Adding Unassigned\, PrivateUse\, Han\, and InHangulSyllables produces a slight gain on the PQ set​:

  % unichars -u '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' | wc -l   2832

And of course a substantial gain on the \W set​:

  % unichars -u '\W' | wc -l   13556

I am somewhat surprrised to see more identifier characters in the PQ set than I had reckoned with​:

  % unichars '[\p{Pattern_Syntax}\p{Pattern_White_Space}\p{Default_Ignorable_Code_Point}]' '\w'   อ 847 034F COMBINING GRAPHEME JOINER   แ…Ÿ 4447 115F HANGUL CHOSEONG FILLER   แ…  4448 1160 HANGUL JUNGSEONG FILLER   แ ‹ 6155 180B MONGOLIAN FREE VARIATION SELECTOR ONE   แ Œ 6156 180C MONGOLIAN FREE VARIATION SELECTOR TWO   แ  6157 180D MONGOLIAN FREE VARIATION SELECTOR THREE   โธฏ 11823 2E2F VERTICAL TILDE   ใ…ค 12644 3164 HANGUL FILLER   ๏ธ€ 65024 FE00 VARIATION SELECTOR-1   ๏ธ 65025 FE01 VARIATION SELECTOR-2   ๏ธ‚ 65026 FE02 VARIATION SELECTOR-3   ๏ธƒ 65027 FE03 VARIATION SELECTOR-4   ๏ธ„ 65028 FE04 VARIATION SELECTOR-5   ๏ธ… 65029 FE05 VARIATION SELECTOR-6   ๏ธ† 65030 FE06 VARIATION SELECTOR-7   ๏ธ‡ 65031 FE07 VARIATION SELECTOR-8   ๏ธˆ 65032 FE08 VARIATION SELECTOR-9   ๏ธ‰ 65033 FE09 VARIATION SELECTOR-10   ๏ธŠ 65034 FE0A VARIATION SELECTOR-11   ๏ธ‹ 65035 FE0B VARIATION SELECTOR-12   ๏ธŒ 65036 FE0C VARIATION SELECTOR-13   ๏ธ 65037 FE0D VARIATION SELECTOR-14   ๏ธŽ 65038 FE0E VARIATION SELECTOR-15   ๏ธ 65039 FE0F VARIATION SELECTOR-16   ๏พ  65440 FFA0 HALFWIDTH HANGUL FILLER

I believe Tom has a better handle on the implications than me.

Maybe\, maybe not.

I await his further ideas.

It would save us on space to make quotemeta working on the smaller PQ set than on the entire \W set\, but would it save us anything on time?

I mean apart from the obvious that it takes time to allocate more stuff that would need quoting; I mean the time involved in looking up properties.

I don't really know much how the swatches work\, nor the true costs of looking up properties in general\, so I can't guess there. My instinct is to run just quickly backlash any \W\, but that's an ASCII-only instinct established before we had guidelines for the PQ set\, let alone for valid identifier characters.

If there is nothing substantial to be gained by using the broader \W over the previously defined Pattern_Quote set\, then I think we should use PQ. I also think we should use PQ if there were some error that \W might introduce; the outliers that are both PQ and \w suggest there might be.

--tom

p5pRT commented 13 years ago

From @khwilliamson

Tom Christiansen wrote​:

SUMMARY​: I believe that if nothing substantial can be gained by using the broader \W over what UAX#31 says to quote\, we should use UAX#31's suggestions to implement quotemeta() and \Q on Unicode.

     I also think we should use those suggestions if there were some
     error that \\W might introduce\.  The code points I show below
     which are both on UAX\#31's things to quote list but which also
 happen to be \\w characters suggest that there may be\.

Karl wrote​:

One would think that backslashing all \p{Pattern_Syntax} characters would be the right thing to do. There are 2417 Pattern_Syntax code points\, all of which are in the BMP\, none in the astral planes. But there is one code point which is both \w and yet considered pattern syntax; it's a \p{Lm} character​:

% unichars \-c '\\p\{Pattern\_Syntax\}' '\\w'
 โธฏ 11823 2E2F GC=Lm VERTICAL TILDE

I don't know whether that is a mistake or not. Karl?

I have emailed Unicode about this apparent discrepancy.

Good\, thank you.

I've gotten a (rapid) preliminary response. Their definition of \w appears to be flawed\, and likely should be revised to exclude U+2E2F.

So I don't know what to do. This may be complicated by the fact that Perl botched what are considered identifiers. My guess from the comments is that it stems from the fact that Unicode botched the definition of alpha between v1.9 and 3.0.1. sprout has gone in in 5.13 and fixed the definition so that it doesn't hang the parser\, but for backwards compatibility\, it doesn't match the Unicode identifier definition\, and that is somewhat bothersome to me.

Could you please explain what that means\, that Unicode botched the definition of alpha between v1.9 and 3.0.1?

Here are my comments in mktables\, added when I researched the problem​: # The number of code points in \p{alpha} halved in 2.1.9. It turns out # that the reason is that the CJK block starting at 4E00 was removed # from PropList\, and was not put back in until 3.1.0

And here are the comments from handy.h​: /* The ID_Start of Unicode was originally quite limiting​: it assumed an   * L-class character (meaning that you could not have\, say\, a CJK charac-   * ter). So\, instead\, perl has for a long time allowed ID_Continue but   * not digits.   * We still preserve that for backward compatibility. But we also make sure   * that it is alphanumeric\, so S_scan_word in toke.c will not hang. See   * http​://rt.perl.org/rt3/Ticket/Display.html?id=74022   * for more detail than you ever wanted to know about. */

My working definitions of an alpha and an idenitifier charclass in Java work out to these​:

alphabetic\_charclass =
       "\["
     \+      "\\\\pL"            /\* all Letters    \*/
     \+      "\\\\pM"            /\* all Marks      \*/
     \+      "\\\\p\{Nl\}"         /\* Letter Number  \*/
     \+ "\]";

identifier\_charclass =
        "\["
     \+      "\\\\pL"          /\* all Letters      \*/
     \+      "\\\\pM"          /\* all Marks        \*/
     \+      "\\\\p\{Nd\}"       /\* Decimal Number   \*/
     \+      "\\\\p\{Nl\}"       /\* Letter Number    \*/
     \+      "\\\\p\{Pc\}"       /\* Connector Punctuation           \*/
     \+      "\["             /\*    or else chars which are both \*/
     \+          "\\\\p\{InEnclosedAlphanumerics\}"
     \+          "&&"          /\*    and also      \*/
     \+          "\\\\p\{So\}"   /\* Other Symbol     \*/
     \+      "\]"
     \+  "\]";

Now\, that's not quite the way #31's section 2 reads\, but it may be (close to) equivalent; I haven't checked. Hm\, I'm pretty sure that I have ZWJ and ZWNJ issues there\, something that I addressed in working out extended grapheme clusters but never backported to regular old identifier-class characters.

What part of the sense of "alpha" or "identifier" did Perl and Unicode part ways on? Is this perhaps only in Perl's parser\, not in its notions of properties? Does it have to do with #31 section 2\, or something else?

See the comments above. Perl doesn't use IDStart at all. Instead it currently uses​: #define isIDFIRST_utf8(p) \   (is_utf8_idcont(p) && !is_utf8_digit(p) && is_utf8_alnum(p))

That definition has only been in place for some of the 5.13.X releases.   Prior to that\, the definition was​: #define isIDFIRST_utf8(p) (is_utf8_idcont(p) && !is_utf8_digit(p))

This caused the parser to loop on some inputs. The details are in the trouble ticket mentioned above. My own view is that it would be better to move to the Unicode definition\, but there is the backward compatibility issue.

The Unicode recommendation is to only quote the pattern white space and identifier characters plus the default ignorable code points. That means most controls would not get quoted.

That would save on space.

That principle is that\, in patterns&#8203;:

    \*  a \\w character never means anything special
    \*  a \\W character might mean something special

Whence it follows that

    \*  backslashing a \\w character might mean something special
    \*  backslashing a \\W character never means anything special

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

Modulo the problematic U+2E2F\, I believe that quoting all \W characters is both safe and a proper superset of UAX#31. My only question is whether there is anything to be gained by reducing that superset down to quoting only those code points with any of

Pattern\_Syntax
Pattern\_White\_Space
Default\_Ignorable\_Code\_Point

Let's for this discussion call those the Pattern_Quotable set\, or PQ.

The considerations are time and space.

  On space\, there are certainly more

\W characters than there are QP characters. In the BMP​:

% unichars '\[\\p\{Pattern\_Syntax\}\\p\{Pattern\_White\_Space\}\\p\{Default\_Ignorable\_Code\_Point\}\]' | wc \-l
    2475

% unichars '\\W' | wc \-l
    4137

Adding Unassigned\, PrivateUse\, Han\, and InHangulSyllables produces a slight gain on the PQ set​:

% unichars \-u '\[\\p\{Pattern\_Syntax\}\\p\{Pattern\_White\_Space\}\\p\{Default\_Ignorable\_Code\_Point\}\]' | wc \-l
    2832

And of course a substantial gain on the \W set​:

% unichars \-u '\\W' | wc \-l
   13556

I am somewhat surprrised to see more identifier characters in the PQ set than I had reckoned with​:

% unichars '\[\\p\{Pattern\_Syntax\}\\p\{Pattern\_White\_Space\}\\p\{Default\_Ignorable\_Code\_Point\}\]' '\\w'
 อ   847 034F COMBINING GRAPHEME JOINER
 แ…Ÿ  4447 115F HANGUL CHOSEONG FILLER
 แ…   4448 1160 HANGUL JUNGSEONG FILLER
 แ ‹  6155 180B MONGOLIAN FREE VARIATION SELECTOR ONE
 แ Œ  6156 180C MONGOLIAN FREE VARIATION SELECTOR TWO
 แ   6157 180D MONGOLIAN FREE VARIATION SELECTOR THREE
 โธฏ 11823 2E2F VERTICAL TILDE
 ใ…ค 12644 3164 HANGUL FILLER
 ๏ธ€ 65024 FE00 VARIATION SELECTOR\-1
 ๏ธ 65025 FE01 VARIATION SELECTOR\-2
 ๏ธ‚ 65026 FE02 VARIATION SELECTOR\-3
 ๏ธƒ 65027 FE03 VARIATION SELECTOR\-4
 ๏ธ„ 65028 FE04 VARIATION SELECTOR\-5
 ๏ธ… 65029 FE05 VARIATION SELECTOR\-6
 ๏ธ† 65030 FE06 VARIATION SELECTOR\-7
 ๏ธ‡ 65031 FE07 VARIATION SELECTOR\-8
 ๏ธˆ 65032 FE08 VARIATION SELECTOR\-9
 ๏ธ‰ 65033 FE09 VARIATION SELECTOR\-10
 ๏ธŠ 65034 FE0A VARIATION SELECTOR\-11
 ๏ธ‹ 65035 FE0B VARIATION SELECTOR\-12
 ๏ธŒ 65036 FE0C VARIATION SELECTOR\-13
 ๏ธ 65037 FE0D VARIATION SELECTOR\-14
 ๏ธŽ 65038 FE0E VARIATION SELECTOR\-15
 ๏ธ 65039 FE0F VARIATION SELECTOR\-16
 ๏พ  65440 FFA0 HALFWIDTH HANGUL FILLER

There's something wrong if this includes only the first 16 variation selectors as all 256 are Default Ignorable. If you don't include the DI characters in PQ\, I suspect you get close to your reckoning.

I believe Tom has a better handle on the implications than me.

Maybe\, maybe not.

I await his further ideas.

It would save us on space to make quotemeta working on the smaller PQ set than on the entire \W set\, but would it save us anything on time?

I mean apart from the obvious that it takes time to allocate more stuff that would need quoting; I mean the time involved in looking up properties.

I don't really know much how the swatches work\, nor the true costs of looking up properties in general\, so I can't guess there. My instinct is to run just quickly backlash any \W\, but that's an ASCII-only instinct established before we had guidelines for the PQ set\, let alone for valid identifier characters.

If there is nothing substantial to be gained by using the broader \W over the previously defined Pattern_Quote set\, then I think we should use PQ. I also think we should use PQ if there were some error that \W might introduce; the outliers that are both PQ and \w suggest there might be.

I think the differences in time/space are in the noise. swashes aren't the right data structure to use anyway\, and I'm planning to replace them for 5.16\, but even if that doesn't happen in that release\, we shouldn't base this decision on something we intend to remove.

--tom

p5pRT commented 13 years ago

From @ikegami

On Thu\, Dec 16\, 2010 at 2​:47 PM\, Tom Christiansen \tchrist@&#8203;perl\.com wrote​:

In point of fact\, there are uniquely 12 and 12 only metacharacters in Perl regexes\, the dirty dozen of​:

   \\ | \( \) \[ \{ ^ $ \* \+ ? \.

"-" and "^" are meta in certain positions.

/[^a]/ vs /[\^a]/ /[a-c]/ vs /[a\-c]/

p5pRT commented 13 years ago

From tchrist@perl.com

"-" and "^" are meta in certain positions.

/[^a]/ vs /[\^a]/ /[a-c]/ vs /[a\-c]/

I elsewhere wrote that charclasses operate under different rules.

--tom

p5pRT commented 13 years ago

From @abigail

On Thu\, Dec 16\, 2010 at 12​:47​:21PM -0700\, Tom Christiansen wrote​:

In point of fact\, there are uniquely 12 and 12 only metacharacters
in Perl regexes\, the dirty dozen of&#8203;:

    \\ | \( \) \[ \{ ^ $ \* \+ ? \.

I've always wondered why a lone } or ] does not need escaping (they're only special after an opening { or [ has been seen)\, but a lone ) does.

The question becomes whether we want the flexibility to someday extend
our set of metacharacters beyond those 12\.  The quotemeta behavior of
backslashing any and all \\W characters no matter what\, while always
leaving inviolate all \\w characters\, was designed to provide for that\.

We've never drawn upon our the \\W reservoir for other pattern matching
operations in Perl5\, but Perl6 has\.

And I don't think Perl5 every will. There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\, that is seems unlike p5p will ever judge the advantages of using a new \W character for metapurposes to outweigth the downside of breaking code.

Abigail

p5pRT commented 13 years ago

From tchrist@perl.com

I've always wondered why a lone } or ] does not need escaping (they're only special after an opening { or [ has been seen)\, but a lone ) does.

So have I. It could be worse​: things like quantifiers still need escaping to be made literals even if they couldn't quantify something\, such as at the beginning of a string. A (poor) argument could be made that in such a position\, escaping isn't necessary to infer function\, and it seems to me some nasty regex dialects do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\,

Now that you mention it\, you're right\, we do. Hadn't thought of that.

--tom

p5pRT commented 13 years ago

From @iabyn

On Fri\, Dec 17\, 2010 at 08​:11​:15AM -0700\, Tom Christiansen wrote​:

I've always wondered why a lone } or ] does not need escaping (they're only special after an opening { or [ has been seen)\, but a lone ) does.

So have I. It could be worse​: things like quantifiers still need escaping to be made literals even if they couldn't quantify something\, such as at the beginning of a string. A (poor) argument could be made that in such a position\, escaping isn't necessary to infer function\, and it seems to me some nasty regex dialects do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\,

Now that you mention it\, you're right\, we do. Hadn't thought of that.

Ok. How about the following resolution​: we change it so that utf8 strings get chr(128)-chr(255) escaped\, so that it matches the non-utf8 case\, and leave chars > 255 unescaped. In some future world if chars > 255 start having special meaning to the regex engine\, then we start escaping them too.

-- Technology is dominated by two types of people​: those who understand what they do not manage\, and those who manage what they do not understand.

p5pRT commented 12 years ago

From @khwilliamson

On 12/29/2010 03​:57 AM\, Dave Mitchell wrote​:

On Fri\, Dec 17\, 2010 at 08​:11​:15AM -0700\, Tom Christiansen wrote​:

I've always wondered why a lone } or ] does not need escaping (they're only special after an opening { or [ has been seen)\, but a lone ) does.

So have I. It could be worse​: things like quantifiers still need escaping to be made literals even if they couldn't quantify something\, such as at the beginning of a string. A (poor) argument could be made that in such a position\, escaping isn't necessary to infer function\, and it seems to me some nasty regex dialects do just that. I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\,

Now that you mention it\, you're right\, we do. Hadn't thought of that.

Ok. How about the following resolution​: we change it so that utf8 strings get chr(128)-chr(255) escaped\, so that it matches the non-utf8 case\, and leave chars> 255 unescaped. In some future world if chars> 255 start having special meaning to the regex engine\, then we start escaping them too.

This proposal and all others died in 5.14 for lack of consensus. This leaves the Unicode bug extant for quotemeta\, and I would like to get it fixed. Tom has told me privately that he's ok with changing things to get consistent rules for UTF8- vs non-UTF8 encoded strings.

I'm thinking we should just do what the original trouble ticket asks for\, and what the documentation has always said\, and that is to quote everything that matches [^a-zA-Z0-9_]. This agrees with the first part of Dave's proposal\, but makes all above Latin1 chars also escaped.

I'm reopening this publicly now\, in order to try to get resolution in the next week or so\, so that we can do something for 5.16. Either proposal is easy to implement\, and fast in cpu cycles.

If we do this\, does that close the door on later changing to use the pattern syntax should it ever become necessary? I think that it doesn't. This thread included extensive discussion on that.

p5pRT commented 12 years ago

From @demerphq

On 29 December 2010 11​:57\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Fri\, Dec 17\, 2010 at 08​:11​:15AM -0700\, Tom Christiansen wrote​:

I've always wondered why a lone } or ] does not need escaping (they're only special after an opening { or [ has been seen)\, but a lone ) does.

So have I. ย It could be worse​: things like quantifiers still need escaping to be made literals even if they couldn't quantify something\, such as at the beginning of a string. ย A (poor) argument could be made that in such a position\, escaping isn't necessary to infer function\, and it seems to me some nasty regex dialects do just that. ย I certainly don't care for it.

And I don't think Perl5 every will. There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\,

Now that you mention it\, you're right\, we do. ย Hadn't thought of that.

Ok. How about the following resolution​: we change it so that utf8 strings get chr(128)-chr(255) escaped\, so that it matches the non-utf8 case\, and leave chars > 255 unescaped. In some future world if chars > 255 start having special meaning to the regex engine\, then we start escaping them too.

I think it depends on what we want to do. If quotemeta() is intended for escaping content in a regex\, then we could make it escape ONLY known regex metacharacters\, which would mean very little gets escaped at all.

Some of the options are​:

1) make quotemeta() *not* escape codepoints>127 regardless 2) make quotemeta() escape codepoints>127 regardless 3) make quotemeta() only escape codepoints that are known meta characters.

In terms of back-compat your suggestion (2) or my suggestion (1) are the only viable choices...

BUT\, option 3 has some things to be said for it. Specifically\, its output would be parsed by the regex engine much more efficiently\, which is also why I think that option 1 has a slight edge over option 2.

The efficiency point is also why I think that escaping codepoints we know will never be part of Perl 5's internal regex engine syntax is a bad idea. So for me escaping ALL codepoints larger than 255 is a mistake.

Also\, as an aside to the cc list​: I do not think that what Unicode considers to be pattern syntax is particularly relevant to Perl. While it is something we should consider\, just as we consider precedent by other regex engines\, it is much like a judge in one jurisdiction encountering an unusual case using precedent from another jurisdiction​: it may be useful advise\, but it is not at all binding or authoritative. And lastly\, Unicode is a moving target\, I personally would have big reservations in using its definitions for something like this. We have\, apparently\, wasted a LOT of time trying to be compliant with Unicode\, only to learn that the Unicode proposals don't make sense and then see them deprecated or changed over time. Case folding rules are a particular example that really makes me disinclined to treat Unicode as an authority on what Perl should do in the regex engine.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 6 February 2012 14​:41\, demerphq \demerphq@&#8203;gmail\.com wrote​:

1) make quotemeta() *not* escape codepoints>127 regardless 2) make quotemeta() escape codepoints>127 regardless

To be clear I meant codepoints where​: 127 \< codepoint \< 256

Sorry for the extra mail...

Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From tchrist@perl.com

Also\, as an aside to the cc list​: I do not think that what Unicode considers to be pattern syntax is particularly relevant to Perl. While it is something we should consider\, just as we consider precedent by other regex engines\, it is much like a judge in one jurisdiction encountering an unusual case using precedent from another jurisdiction​: it may be useful advise\, but it is not at all binding or authoritative. And lastly\, Unicode is a moving target\, I personally would have big reservations in using its definitions for something like this. We have\, apparently\, wasted a LOT of time trying to be compliant with Unicode\, only to learn that the Unicode proposals don't make sense and then see them deprecated or changed over time. Case folding rules are a particular example that really makes me disinclined to treat Unicode as an authority on what Perl should do in the regex engine.

Disagree\, several times over.

First of all\, the Pattern_Syntax Unicode character property is *not* defined by UTS#18 on Unicode Regular Expressions\, a document that is by first of being a UTS/UTR inherently informative in nature.

Rather\, it is a defined in UAX#44\, the Unicode Character Database. That means it is part of the Unicode Standard itself. It is also a *normative* property\, not an informative\, contributory\, or provisional one. So is Pattern_Whitespace.

Lastly\, both those properties are\, like the names of the characters themselves\, *immutable*. They are guaranteed to be closed sets that will never change. No new character will ever gain of those two properties\, nor shall any old character that has one of those immutable normative properties ever lose that property.

Now please stop repeating this nonsense about Unicode being a moving target. Things that can change are clearly marked at such\, and things that cannot change similarly. You need to understand which is which\, and why. Unicode has a very clear stability policy. Please familiarize yourself with it.

As for casefolding\, we have *not* "wasted a lot of time". But I am not free to waste my time explaining this just right now.

--tom

p5pRT commented 12 years ago

From @demerphq

On 6 February 2012 15​:13\, Tom Christiansen \tchrist@&#8203;perl\.com wrote​:

Also\, as an aside to the cc list​: I do not think that what Unicode considers to be pattern syntax is particularly relevant to Perl. While it is something we should consider\, just as we consider precedent by other regex engines\, it is much like a judge in one jurisdiction encountering an unusual case using precedent from another jurisdiction​: it may be useful advise\, but it is not at all binding or authoritative. And lastly\, Unicode is a moving target\, I personally would have big reservations in using its definitions for something like this. We have\, apparently\, wasted a LOT of time trying to be compliant with Unicode\, only to learn that the Unicode proposals don't make sense and then see them deprecated or changed over time. Case folding rules are a particular example that really makes me disinclined to treat Unicode as an authority on what Perl should do in the regex engine.

Disagree\, several times over.

First of all\, the Pattern_Syntax Unicode character property is *not* defined by UTS#18 on Unicode Regular Expressions\, a document that is by first of being a UTS/UTR inherently informative in nature.

Not sure how this is relevant.

Rather\, it is a defined in UAX#44\, the Unicode Character Database. ย That means it is part of the Unicode Standard itself. ย It is also a *normative* property\, not an informative\, contributory\, or provisional one. ย So is Pattern_Whitespace.

Lastly\, both those properties are\, like the names of the characters themselves\, *immutable*. ย They are guaranteed to be closed sets that will never change. ย No new character will ever gain of those two properties\, nor shall any old character that has one of those immutable normative properties ever lose that property.

Given that we reserve the right to add new regex meta characters if we wish it seem like this supports my position. Or am i missing something here? (Probably)

Now please stop repeating this nonsense about Unicode being a moving target. ย Things that can change are clearly marked at such\, and things that cannot change similarly. ย  You need to understand which is which\, and why. ย Unicode has a very clear stability policy. ย Please familiarize yourself with it.

I have personal experience with Unicode being a moving target. For instance the introduction of an upper case sharp-ess. Perhaps this is not relevant to the instant case\, but that is not clear to me.

As for casefolding\, we have *not* "wasted a lot of time". ย But I am not free to waste my time explaining this just right now.

It is entirely possible I am misinformed\, but this is my impression of what I recall of Karl's comments on this subject. Some of what I have heard on the subject makes me think I wasted some of my time trying to make it work properly.

Anyway\, as and when you have time I would like to hear more of your thoughts on this. No rush tho\, I am available sporadically this week due to family reasons.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @nwc10

On Mon\, Feb 06\, 2012 at 05​:13​:47PM +0100\, demerphq wrote​:

On 6 February 2012 15​:13\, Tom Christiansen \tchrist@&#8203;perl\.com wrote​:

I have personal experience with Unicode being a moving target. For instance the introduction of an upper case sharp-ess. Perhaps this is not relevant to the instant case\, but that is not clear to me.

Aspects of Unicode aren't fixed yet. Being on the bleeding edge of implementation means that one discovers things like​:

  "รŸ" =~ /ss/i   "รŸ" =~ /(s)(s)/i

um\, has "issues" about what exactly $1 and $2 should be for the capturing variant\, and that these may turn out not to be intuitive​:

  "s" =~ /^[^รŸ]/   "ss" =~ /^[^รŸ]/   "s" =~ /^[^รŸ]/i   "ss" =~ /^[^รŸ]/i

As for casefolding\, we have *not* "wasted a lot of time". ย But I am not free to waste my time explaining this just right now.

It is entirely possible I am misinformed\, but this is my impression of what I recall of Karl's comments on this subject. Some of what I have heard on the subject makes me think I wasted some of my time trying to make it work properly.

Anyway\, as and when you have time I would like to hear more of your thoughts on this. No rush tho\, I am available sporadically this week due to family reasons.

In the passing mailing list traffic I didn't spot anything that made me think that any decisions the Unicode consortium took wasted anyone here's time as far as casefolding went.

Mainly it's that a lot of what they define is fundamentally *hard* to implement\, at least in any scalable performant fashion\, and as it's new\, we don't have a choice of existing implementations to steal from.

Nicholas Clark

p5pRT commented 12 years ago

From @khwilliamson

On 02/06/2012 10​:03 AM\, Nicholas Clark wrote​:

On Mon\, Feb 06\, 2012 at 05​:13​:47PM +0100\, demerphq wrote​:

On 6 February 2012 15​:13\, Tom Christiansen\tchrist@&#8203;perl\.com wrote​:

I have personal experience with Unicode being a moving target. For instance the introduction of an upper case sharp-ess. Perhaps this is not relevant to the instant case\, but that is not clear to me.

Aspects of Unicode aren't fixed yet. Being on the bleeding edge of implementation means that one discovers things like​:

 "รŸ" =~ /ss/i
 "รŸ" =~ /\(s\)\(s\)/i

um\, has "issues" about what exactly $1 and $2 should be for the capturing variant\, and that these may turn out not to be intuitive​:

 "s"  =~ /^\[^รŸ\]/
 "ss" =~ /^\[^รŸ\]/
 "s"  =~ /^\[^รŸ\]/i
 "ss" =~ /^\[^รŸ\]/i

As for casefolding\, we have *not* "wasted a lot of time". But I am not free to waste my time explaining this just right now.

It is entirely possible I am misinformed\, but this is my impression of what I recall of Karl's comments on this subject. Some of what I have heard on the subject makes me think I wasted some of my time trying to make it work properly.

Anyway\, as and when you have time I would like to hear more of your thoughts on this. No rush tho\, I am available sporadically this week due to family reasons.

In the passing mailing list traffic I didn't spot anything that made me think that any decisions the Unicode consortium took wasted anyone here's time as far as casefolding went.

Mainly it's that a lot of what they define is fundamentally *hard* to implement\, at least in any scalable performant fashion\, and as it's new\, we don't have a choice of existing implementations to steal from.

Nicholas Clark

I believe it's decisions they haven't finalized yet. Indications are that they are backing away from suggesting that regexes use full case folding\, because of things like the   "รŸ" =~ /(s)(s)/i anomaly. What Yves may be referring to is that I've mentioned this several times on the list. But Unicode hasn't updated their TR18. I don't know what the hold up is. And what Tom meant is that TR18 is not a part of the Standard\, but merely recommendations. (If it had been part of the Standard\, they would be in a world of hurt with their ill-advised encoding of BELL to mean something other than what it has always meant; perhaps they would have taken better care to not break TR18 if it had been part of the standard. I note that it does introduce a bug into their own CLDR POSIX locales\, as they have to use the term BELL there to mean U+0007)

If they do back away\, then perhaps we will have made wasted effort.

p5pRT commented 12 years ago

From @ikegami

On Mon\, Feb 6\, 2012 at 8​:41 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

I think it depends on what we want to do. If quotemeta() is intended for escaping content in a regex\, then we could make it escape ONLY known regex metacharacters\, which would mean very little gets escaped at all.

That's not very safe. It prevents storing the escaped pattern and using it with a different version of Perl. It is not forward-compatible.

Some of the options are​:

1) make quotemeta() *not* escape codepoints>127 regardless 2) make quotemeta() escape codepoints>127 regardless 3) make quotemeta() only escape codepoints that are known meta characters.

Also mentioned was​:

4) make quotemeta() escape some code-points above 127. (\W\, \p{Pattern_syntax} or some other group to be determined).

Analysis​: (worst-to-best)

(3) is the least forward-compatible. (2) is forward-compatible for as long as we don't start using characters above 127 as "special escapes". (1) is forward-compatible for as long as we don't start using characters above 127 as meta characters. (4) is the most forward-compatible.

(3) is the least backward-compatible (e.g. it would no longer escape "&"). (2) and (4) are backward-compatible with character below 127 (1) is backward-compatible with character below 127 and above 255

(3) is the most dangerous\, affecting characters below 127 (e.g. some might expect "&" to be escaped by quotemeta). (2) and (4) only affects characters above 127. (1) only affects characters for which behaviour was "undefined" (for lack of a better word).

(3) is faster than (1)\, (2) and (4) if you think the time spent parsing "\" is noticeable.

- Eric

p5pRT commented 12 years ago

From @khwilliamson

On 02/06/2012 01​:19 PM\, Eric Brine wrote​:

On Mon\, Feb 6\, 2012 at 8​:41 AM\, demerphq \<demerphq@​gmail.com \mailto&#8203;:demerphq@&#8203;gmail\.com> wrote​:

I think it depends on what we want to do\. If quotemeta\(\) is intended
for escaping content in a regex\, then we could make it escape ONLY
known regex metacharacters\, which would mean very little gets escaped
at all\.

That's not very safe. It prevents storing the escaped pattern and using it with a different version of Perl. It is not forward-compatible.

Some of the options are&#8203;:

1\) make quotemeta\(\) \*not\* escape codepoints>127 regardless
2\) make quotemeta\(\) escape codepoints>127 regardless
3\) make quotemeta\(\) only escape codepoints that are known meta
characters\.

Also mentioned was​:

4) make quotemeta() escape some code-points above 127. (\W\, \p{Pattern_syntax} or some other group to be determined).

Analysis​: (worst-to-best)

(3) is the least forward-compatible. (2) is forward-compatible for as long as we don't start using characters above 127 as "special escapes". (1) is forward-compatible for as long as we don't start using characters above 127 as meta characters. (4) is the most forward-compatible.

(3) is the least backward-compatible (e.g. it would no longer escape "&"). (2) and (4) are backward-compatible with character below 127 (1) is backward-compatible with character below 127 and above 255

(3) is the most dangerous\, affecting characters below 127 (e.g. some might expect "&" to be escaped by quotemeta). (2) and (4) only affects characters above 127. (1) only affects characters for which behaviour was "undefined" (for lack of a better word).

(3) is faster than (1)\, (2) and (4) if you think the time spent parsing "\" is noticeable.

- Eric

Thanks for the analysis. I'd like to throw this comment in from this thread last year from Abigail\, and Dave Mitchell's response​:

There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\, Now that you mention it\, you're right\, we do. Hadn't thought of that.

p5pRT commented 12 years ago

From @ikegami

On Mon\, Feb 6\, 2012 at 6​:37 PM\, Karl Williamson \public@&#8203;khwilliamson\.comwrote​:

Thanks for the analysis. I'd like to throw this comment in from this thread last year from Abigail\, and Dave Mitchell's response​:

There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\, Now that you mention it\, you're right\, we do. Hadn't thought of that.

Good point. From that\, one could conclude that being forward-compatible is not an important factor since there's so much existing regex code that isn't forward-compatible.

p5pRT commented 12 years ago

From @demerphq

On 6 February 2012 21​:19\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

(3) is faster than (1)\, (2) and (4) if you think the time spent parsing "\" is noticeable.

I do not have stats to back me up\, but knowing how the code handles escapes I am pretty confident that a string with lots of unnecessarily escaped characters will be visibly slower than one without.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @demerphq

On 7 February 2012 01​:34\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Mon\, Feb 6\, 2012 at 6​:37 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

Thanks for the analysis. ย I'd like to throw this comment in from this thread last year from Abigail\, and Dave Mitchell's response​:

There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\, Now that you mention it\, you're right\, we do. ย Hadn't thought of that.

Good point. From that\, one could conclude that being forward-compatible is not an important factor since there's so much existing regex code that isn't forward-compatible.

devils advocate​: Or is that it actually is forward compatible because it leaves escaped \W chars available for use by the regex engine?

:-)

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @abigail

On Tue\, Feb 07\, 2012 at 02​:22​:34AM +0100\, demerphq wrote​:

On 7 February 2012 01​:34\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Mon\, Feb 6\, 2012 at 6​:37 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

Thanks for the analysis. ย I'd like to throw this comment in from this thread last year from Abigail\, and Dave Mitchell's response​:

There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\, Now that you mention it\, you're right\, we do. ย Hadn't thought of that.

Good point. From that\, one could conclude that being forward-compatible is not an important factor since there's so much existing regex code that isn't forward-compatible.

devils advocate​: Or is that it actually is forward compatible because it leaves escaped \W chars available for use by the regex engine?

And break an ancient promise? [1] ;-)

  Unlike some other regular expression languages\, there are no backslashed   symbols that arenโ€™t alphanumeric. So anything that looks like \\\,   \(\, \)\, \\<\, \>\, \{\, or \} is always interpreted as a literal character\,   not a metacharacter.

This is in the current manual page\, but the exact same phrasing already appears in the manual pages of perl-3.000.

[1] 22 years counts as ancient.

Abigail

p5pRT commented 12 years ago

From @demerphq

On 7 February 2012 03​:06\, Abigail \abigail@&#8203;abigail\.be wrote​:

On Tue\, Feb 07\, 2012 at 02​:22​:34AM +0100\, demerphq wrote​:

On 7 February 2012 01​:34\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Mon\, Feb 6\, 2012 at 6​:37 PM\, Karl Williamson \public@&#8203;khwilliamson\.com wrote​:

Thanks for the analysis. ย I'd like to throw this comment in from this thread last year from Abigail\, and Dave Mitchell's response​:

There's so much code out there that doesn't escape \W characters outside of the dozen mentioned above (and if we see a newbie escaping a \W outside of the dozen\, we pick on him)\, Now that you mention it\, you're right\, we do. ย Hadn't thought of that.

Good point. From that\, one could conclude that being forward-compatible is not an important factor since there's so much existing regex code that isn't forward-compatible.

devils advocate​: Or is that it actually is forward compatible because it leaves escaped \W chars available for use by the regex engine?

And break an ancient promise? [1] ;-)

ย Unlike some other regular expression languages\, there are no backslashed ย symbols that arenโ€™t alphanumeric. ย So anything that looks like \\\, ย \(\, \)\, \\<\, \>\, \{\, or \} is always interpreted as a literal character\, ย not a metacharacter.

This is in the current manual page\, but the exact same phrasing already appears in the manual pages of perl-3.000.

Yes right\, my bad. Did not think my post through before I sent it.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 12 years ago

From @khwilliamson

On 02/06/2012 05​:34 PM\, Eric Brine wrote​:

On Mon\, Feb 6\, 2012 at 6​:37 PM\, Karl Williamson \<public@​khwilliamson.com \mailto&#8203;:public@&#8203;khwilliamson\.com> wrote​:

Thanks for the analysis\.  I'd like to throw this comment in from
this thread last year from Abigail\, and Dave Mitchell's response&#8203;:

 >> There's so much code out there that
 >> doesn't escape \\W characters outside of the dozen mentioned
above \(and
 >> if we see a newbie escaping a \\W outside of the dozen\, we pick
on him\)\,
 > Now that you mention it\, you're right\, we do\.  Hadn't thought of
that\.

Good point. From that\, one could conclude that being forward-compatible is not an important factor since there's so much existing regex code that isn't forward-compatible.

I've looked over this thread now several times and re-read Unicode's UAX 31. I'll try to succinctly summarize the relevant portions of it.

It essentially suggests that characters that are \p{Pattern_Syntax} are the only ones that could ever have metacharacter meaning. A complete list of these is attached. This list is claimed to be absolutely stable. But you may note that there are several hundred unassigned code points in it. I expect and hope that this means that Unicode will only ever use those code points for characters that it thinks would be appropriate for use as metacharacters.

Unicode also defines a few characters (also attached\, and also completely stable) as \{Pattern_White_Space}. It says that these should not appear in a pattern as literals unless escaped. But Perl does allow some of these as literals unescaped\, except under /x. But for purposes of quotemeta\, which is what is being discussed here\, these should be escaped as well.

UAX 31 also suggests that for readability all other white space (6.1 list also attached) be escaped\, as well as all characters matching \p{Default_Ignorable_Code_Point} (6.1 list attached). Note that these are not stable\, and may grow over time\, and much less likely\, shrink. Many of the default ignorables (DI for short) are generally usually invisible in output\, so it is a good idea to escape them. They don't include the controls.

If Perl is willing to never use other than a pattern syntax character as a metacharacter\, then we can reasonably use UAX 31 as a basis for quoting. If we decide\, as some suggest that we will never use any thing other than what we have already used\, then it really doesn't matter\, but we do need to fix things so that non-utf8 encoded strings and utf8-encoded strings behave the same on the same code points.

Another reasonable basis is to use \W\, which Tom has pointed earlier in this thread comes from first principals of how Perl's definition. Tom however also pointed out that there is a single character that matches \W that could cause problems\, U+2E2F VERTICAL TILDE. When I emailed Unicode a year ago about it\, they said they would look into it; but nothing happened. I just reminded them. But regardless\, one of their responses indicated that they did not see any anomaly here (I refer you to Tom's post for details)\, so that even if they change this one\, the could encode new such characters in the future.

Thus\, I'm coming down to Tom's conclusion that if we do quoting based on code point properties\, that it would be better to use UAX 31 instead of \W.

People have talked about the speed of parsing quoted characters. But there is a cost that hasn't been mentioned\, which is the speed of figuring out by quotemeta if a code point should be quoted or not. It's much faster to just quote all code points above 127 or 255 than to have to parse and go out to disk to compute a swash. However\, I have code that is next in line to be fully smoked (having passed the quick smoke-me's) that allows for compile time inclusion of property definitions. I believe that using it would lower this cost to an acceptable level. There still would be a swash\, but its contents would be known at compile time.

I have now formulated the following proposal​:

Non-utf8 string\, not feature unicode_strings​:   quote \W ASCII range\, plus all code points 128-255;   nothing else quoted.

Otherwise\,   quote \W ASCII range\, plus all pattern syntax\, pattern white   space\, regular white space\, and default ignorable code points;   nothing else quoted.

It may be that we decide we will never use anything outside the dozen we already do; but it seems to me to be prudent to not box ourselves in forever to this stance. Hence\, I think we should do some quoting of characters above ASCII.

This solution is completely backwards compatible in the ASCII range. It is completely backwards compatible in the Latin1 range provided you aren't using unicode_strings. unicode_strings was never advertised as applying to quotemeta\, but it seems like a reasonable extension of its use to me; another alternative would be to come up with yet another feature\, say 'quote_unicode_strings'.

The solution isn't backwards compatible above Latin1; nothing we do is\, unless we create a new feature.

p5pRT commented 12 years ago

From @khwilliamson

Pat_Syn

p5pRT commented 12 years ago

From @khwilliamson

# !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # This file is machine-generated by lib/unicore/mktables from the Unicode # database\, Version 6.1.0. Any changes made here will be lost!

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!! # This file is for internal use by core Perl only. The format and even the # name or existence of this file are subject to change without notice. Don't # use it directly.

# Use Unicode​::UCD​::prop_invlist() to access the contents of this file. # # This file returns the 11 code points in Unicode Version 6.1.0 that match # any of the following regular expression constructs​: # # \p{Pattern_White_Space=Yes} # \p{Pat_WS=Y} # \p{Pattern_White_Space=T} # \p{Pat_WS=True} # # \p{Pattern_White_Space} # \p{Is_Pattern_White_Space} # \p{Pat_WS} # \p{Is_Pat_WS} # # perluniprops.pod should be consulted for the syntax rules for any of these\, # including if adding or subtracting white space\, underscore\, and hyphen # characters matters or doesn't matter\, and other permissible syntactic # variants. Upper/lower case distinctions never matter. # # A colon can be substituted for the equals sign\, and anything to the left of # the equals (or colon) can be combined with anything to the right. Thus\, # for example\, # \p{Pat_WS​: Yes} # is also valid. # # The format of the lines of this file is​: START\tSTOP\twhere START is the # starting code point of the range\, in hex; STOP is the ending point\, or if # omitted\, the range has just one code point. Numbers in comments in # [brackets] indicate how many code points are in the range.

return \<\<'END' =~ s/\s*#.*//mgr; 0009 # CHARACTER TABULATION 000A # LINE FEED (LF) 000B # LINE TABULATION 000C # FORM FEED (FF) 000D # CARRIAGE RETURN (CR) 0020 # ' ' SPACE 0085 # NEXT LINE (NEL) 200E # 'โ€Ž' LEFT-TO-RIGHT MARK 200F # 'โ€' RIGHT-TO-LEFT MARK 2028 # LINE SEPARATOR 2029 # PARAGRAPH SEPARATOR END

p5pRT commented 12 years ago

From @khwilliamson

# !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # This file is machine-generated by lib/unicore/mktables from the Unicode # database\, Version 6.1.0. Any changes made here will be lost!

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!! # This file is for internal use by core Perl only. The format and even the # name or existence of this file are subject to change without notice. Don't # use it directly.

# Use Unicode​::UCD​::prop_invlist() to access the contents of this file. # # This file returns the 26 code points in Unicode Version 6.1.0 that match # any of the following regular expression constructs​: # # \p{White_Space=Yes} # \p{WSpace=Y} # \p{Space=T} # \p{White_Space=True} # # \p{White_Space} # \p{Is_White_Space} # \p{WSpace} # \p{Is_WSpace} # # \p{Space} # \p{XPosixSpace} # \p{Is_Space} # \p{Is_XPosixSpace} # # Meaning​: \s including beyond ASCII plus vertical tab # # perluniprops.pod should be consulted for the syntax rules for any of these\, # including if adding or subtracting white space\, underscore\, and hyphen # characters matters or doesn't matter\, and other permissible syntactic # variants. Upper/lower case distinctions never matter. # # A colon can be substituted for the equals sign\, and anything to the left of # the equals (or colon) can be combined with anything to the right. Thus\, # for example\, # \p{Space​: Yes} # is also valid. # # The format of the lines of this file is​: START\tSTOP\twhere START is the # starting code point of the range\, in hex; STOP is the ending point\, or if # omitted\, the range has just one code point. Numbers in comments in # [brackets] indicate how many code points are in the range.

return \<\<'END' =~ s/\s*#.*//mgr; 0009 # CHARACTER TABULATION 000A # LINE FEED (LF) 000B # LINE TABULATION 000C # FORM FEED (FF) 000D # CARRIAGE RETURN (CR) 0020 # ' ' SPACE 0085 # NEXT LINE (NEL) 00A0 # 'ย ' NO-BREAK SPACE 1680 # 'แš€' OGHAM SPACE MARK 180E # 'แ Ž' MONGOLIAN VOWEL SEPARATOR 2000 # 'โ€€' EN QUAD 2001 # 'โ€' EM QUAD 2002 # 'โ€‚' EN SPACE 2003 # 'โ€ƒ' EM SPACE 2004 # 'โ€„' THREE-PER-EM SPACE 2005 # 'โ€…' FOUR-PER-EM SPACE 2006 # 'โ€†' SIX-PER-EM SPACE 2007 # 'โ€‡' FIGURE SPACE 2008 # 'โ€ˆ' PUNCTUATION SPACE 2009 # 'โ€‰' THIN SPACE 200A # 'โ€Š' HAIR SPACE 2028 # LINE SEPARATOR 2029 # PARAGRAPH SEPARATOR 202F # 'โ€ฏ' NARROW NO-BREAK SPACE 205F # 'โŸ' MEDIUM MATHEMATICAL SPACE 3000 # 'ใ€€' IDEOGRAPHIC SPACE END

p5pRT commented 12 years ago

From @khwilliamson

# !!!!!!! DO NOT EDIT THIS FILE !!!!!!! # This file is machine-generated by lib/unicore/mktables from the Unicode # database\, Version 6.1.0. Any changes made here will be lost!

# !!!!!!! INTERNAL PERL USE ONLY !!!!!!! # This file is for internal use by core Perl only. The format and even the # name or existence of this file are subject to change without notice. Don't # use it directly.

# Use Unicode​::UCD​::prop_invlist() to access the contents of this file. # # This file returns the 4167 code points in Unicode Version 6.1.0 that match # any of the following regular expression constructs​: # # \p{Default_Ignorable_Code_Point=Yes} # \p{DI=Y} # \p{Default_Ignorable_Code_Point=T} # \p{DI=True} # # \p{Default_Ignorable_Code_Point} # \p{Is_Default_Ignorable_Code_Point} # \p{DI} # \p{Is_DI} # # perluniprops.pod should be consulted for the syntax rules for any of these\, # including if adding or subtracting white space\, underscore\, and hyphen # characters matters or doesn't matter\, and other permissible syntactic # variants. Upper/lower case distinctions never matter. # # A colon can be substituted for the equals sign\, and anything to the left of # the equals (or colon) can be combined with anything to the right. Thus\, # for example\, # \p{DI​: Yes} # is also valid. # # The format of the lines of this file is​: START\tSTOP\twhere START is the # starting code point of the range\, in hex; STOP is the ending point\, or if # omitted\, the range has just one code point. Numbers in comments in # [brackets] indicate how many code points are in the range.

return \<\<'END' =~ s/\s*#.*//mgr; 00AD # 'ยญ' SOFT HYPHEN 034F # 'อ' COMBINING GRAPHEME JOINER 115F # 'แ…Ÿ' HANGUL CHOSEONG FILLER 1160 # 'แ… ' HANGUL JUNGSEONG FILLER 17B4 # 'แžด' KHMER VOWEL INHERENT AQ 17B5 # 'แžต' KHMER VOWEL INHERENT AA 180B # 'แ ‹' MONGOLIAN FREE VARIATION SELECTOR ONE 180C # 'แ Œ' MONGOLIAN FREE VARIATION SELECTOR TWO 180D # 'แ ' MONGOLIAN FREE VARIATION SELECTOR THREE 200B # 'โ€‹' ZERO WIDTH SPACE 200C # 'โ€Œ' ZERO WIDTH NON-JOINER 200D # 'โ€' ZERO WIDTH JOINER 200E # 'โ€Ž' LEFT-TO-RIGHT MARK 200F # 'โ€' RIGHT-TO-LEFT MARK 202A # 'โ€ช' LEFT-TO-RIGHT EMBEDDING 202B # 'โ€ซ' RIGHT-TO-LEFT EMBEDDING 202C # 'โ€ฌ' POP DIRECTIONAL FORMATTING 202D # 'โ€ญ' LEFT-TO-RIGHT OVERRIDE 202E # 'โ€ฎ' RIGHT-TO-LEFT OVERRIDE 2060 # 'โ ' WORD JOINER 2061 # 'โก' FUNCTION APPLICATION 2062 # 'โข' INVISIBLE TIMES 2063 # 'โฃ' INVISIBLE SEPARATOR 2064 # 'โค' INVISIBLE PLUS 2065 2069 # Unassigned\, block=General_Punctuation [5] 206A # 'โช' INHIBIT SYMMETRIC SWAPPING 206B # 'โซ' ACTIVATE SYMMETRIC SWAPPING 206C # 'โฌ' INHIBIT ARABIC FORM SHAPING 206D # 'โญ' ACTIVATE ARABIC FORM SHAPING 206E # 'โฎ' NATIONAL DIGIT SHAPES 206F # 'โฏ' NOMINAL DIGIT SHAPES 3164 # 'ใ…ค' HANGUL FILLER FE00 # '๏ธ€' VARIATION SELECTOR-1 FE01 # '๏ธ' VARIATION SELECTOR-2 FE02 # '๏ธ‚' VARIATION SELECTOR-3 FE03 # '๏ธƒ' VARIATION SELECTOR-4 FE04 # '๏ธ„' VARIATION SELECTOR-5 FE05 # '๏ธ…' VARIATION SELECTOR-6 FE06 # '๏ธ†' VARIATION SELECTOR-7 FE07 # '๏ธ‡' VARIATION SELECTOR-8 FE08 # '๏ธˆ' VARIATION SELECTOR-9 FE09 # '๏ธ‰' VARIATION SELECTOR-10 FE0A # '๏ธŠ' VARIATION SELECTOR-11 FE0B # '๏ธ‹' VARIATION SELECTOR-12 FE0C # '๏ธŒ' VARIATION SELECTOR-13 FE0D # '๏ธ' VARIATION SELECTOR-14 FE0E # '๏ธŽ' VARIATION SELECTOR-15 FE0F # '๏ธ' VARIATION SELECTOR-16 FEFF # '๏ปฟ' ZERO WIDTH NO-BREAK SPACE FFA0 # '๏พ ' HALFWIDTH HANGUL FILLER FFF0 FFF8 # Unassigned\, block=Specials [9] 1D173 # '๐…ณ' MUSICAL SYMBOL BEGIN BEAM 1D174 # '๐…ด' MUSICAL SYMBOL END BEAM 1D175 # '๐…ต' MUSICAL SYMBOL BEGIN TIE 1D176 # '๐…ถ' MUSICAL SYMBOL END TIE 1D177 # '๐…ท' MUSICAL SYMBOL BEGIN SLUR 1D178 # '๐…ธ' MUSICAL SYMBOL END SLUR 1D179 # '๐…น' MUSICAL SYMBOL BEGIN PHRASE 1D17A # '๐…บ' MUSICAL SYMBOL END PHRASE E0000 # Unassigned\, block=Tags E0001 # '๓ €' LANGUAGE TAG E0002 E001F # Unassigned\, block=Tags [30] E0020 # '๓ € ' TAG SPACE E0021 # '๓ €ก' TAG EXCLAMATION MARK E0022 # '๓ €ข' TAG QUOTATION MARK E0023 # '๓ €ฃ' TAG NUMBER SIGN E0024 # '๓ €ค' TAG DOLLAR SIGN E0025 # '๓ €ฅ' TAG PERCENT SIGN E0026 # '๓ €ฆ' TAG AMPERSAND E0027 # '๓ €ง' TAG APOSTROPHE E0028 # '๓ €จ' TAG LEFT PARENTHESIS E0029 # '๓ €ฉ' TAG RIGHT PARENTHESIS E002A # '๓ €ช' TAG ASTERISK E002B # '๓ €ซ' TAG PLUS SIGN E002C # '๓ €ฌ' TAG COMMA E002D # '๓ €ญ' TAG HYPHEN-MINUS E002E # '๓ €ฎ' TAG FULL STOP E002F # '๓ €ฏ' TAG SOLIDUS E0030 # '๓ €ฐ' TAG DIGIT ZERO E0031 # '๓ €ฑ' TAG DIGIT ONE E0032 # '๓ €ฒ' TAG DIGIT TWO E0033 # '๓ €ณ' TAG DIGIT THREE E0034 # '๓ €ด' TAG DIGIT FOUR E0035 # '๓ €ต' TAG DIGIT FIVE E0036 # '๓ €ถ' TAG DIGIT SIX E0037 # '๓ €ท' TAG DIGIT SEVEN E0038 # '๓ €ธ' TAG DIGIT EIGHT E0039 # '๓ €น' TAG DIGIT NINE E003A # '๓ €บ' TAG COLON E003B # '๓ €ป' TAG SEMICOLON E003C # '๓ €ผ' TAG LESS-THAN SIGN E003D # '๓ €ฝ' TAG EQUALS SIGN E003E # '๓ €พ' TAG GREATER-THAN SIGN E003F # '๓ €ฟ' TAG QUESTION MARK E0040 # '๓ €' TAG COMMERCIAL AT E0041 # '๓ ' TAG LATIN CAPITAL LETTER A E0042 # '๓ ‚' TAG LATIN CAPITAL LETTER B E0043 # '๓ ƒ' TAG LATIN CAPITAL LETTER C E0044 # '๓ „' TAG LATIN CAPITAL LETTER D E0045 # '๓ …' TAG LATIN CAPITAL LETTER E E0046 # '๓ †' TAG LATIN CAPITAL LETTER F E0047 # '๓ ‡' TAG LATIN CAPITAL LETTER G E0048 # '๓ ˆ' TAG LATIN CAPITAL LETTER H E0049 # '๓ ‰' TAG LATIN CAPITAL LETTER I E004A # '๓ Š' TAG LATIN CAPITAL LETTER J E004B # '๓ ‹' TAG LATIN CAPITAL LETTER K E004C # '๓ Œ' TAG LATIN CAPITAL LETTER L E004D # '๓ ' TAG LATIN CAPITAL LETTER M E004E # '๓ Ž' TAG LATIN CAPITAL LETTER N E004F # '๓ ' TAG LATIN CAPITAL LETTER O E0050 # '๓ ' TAG LATIN CAPITAL LETTER P E0051 # '๓ ‘' TAG LATIN CAPITAL LETTER Q E0052 # '๓ ’' TAG LATIN CAPITAL LETTER R E0053 # '๓ “' TAG LATIN CAPITAL LETTER S E0054 # '๓ ”' TAG LATIN CAPITAL LETTER T E0055 # '๓ •' TAG LATIN CAPITAL LETTER U E0056 # '๓ –' TAG LATIN CAPITAL LETTER V E0057 # '๓ —' TAG LATIN CAPITAL LETTER W E0058 # '๓ ˜' TAG LATIN CAPITAL LETTER X E0059 # '๓ ™' TAG LATIN CAPITAL LETTER Y E005A # '๓ š' TAG LATIN CAPITAL LETTER Z E005B # '๓ ›' TAG LEFT SQUARE BRACKET E005C # '๓ œ' TAG REVERSE SOLIDUS E005D # '๓ ' TAG RIGHT SQUARE BRACKET E005E # '๓ ž' TAG CIRCUMFLEX ACCENT E005F # '๓ Ÿ' TAG LOW LINE E0060 # '๓  ' TAG GRAVE ACCENT E0061 # '๓ ก' TAG LATIN SMALL LETTER A E0062 # '๓ ข' TAG LATIN SMALL LETTER B E0063 # '๓ ฃ' TAG LATIN SMALL LETTER C E0064 # '๓ ค' TAG LATIN SMALL LETTER D E0065 # '๓ ฅ' TAG LATIN SMALL LETTER E E0066 # '๓ ฆ' TAG LATIN SMALL LETTER F E0067 # '๓ ง' TAG LATIN SMALL LETTER G E0068 # '๓ จ' TAG LATIN SMALL LETTER H E0069 # '๓ ฉ' TAG LATIN SMALL LETTER I E006A # '๓ ช' TAG LATIN SMALL LETTER J E006B # '๓ ซ' TAG LATIN SMALL LETTER K E006C # '๓ ฌ' TAG LATIN SMALL LETTER L E006D # '๓ ญ' TAG LATIN SMALL LETTER M E006E # '๓ ฎ' TAG LATIN SMALL LETTER N E006F # '๓ ฏ' TAG LATIN SMALL LETTER O E0070 # '๓ ฐ' TAG LATIN SMALL LETTER P E0071 # '๓ ฑ' TAG LATIN SMALL LETTER Q E0072 # '๓ ฒ' TAG LATIN SMALL LETTER R E0073 # '๓ ณ' TAG LATIN SMALL LETTER S E0074 # '๓ ด' TAG LATIN SMALL LETTER T E0075 # '๓ ต' TAG LATIN SMALL LETTER U E0076 # '๓ ถ' TAG LATIN SMALL LETTER V E0077 # '๓ ท' TAG LATIN SMALL LETTER W E0078 # '๓ ธ' TAG LATIN SMALL LETTER X E0079 # '๓ น' TAG LATIN SMALL LETTER Y E007A # '๓ บ' TAG LATIN SMALL LETTER Z E007B # '๓ ป' TAG LEFT CURLY BRACKET E007C # '๓ ผ' TAG VERTICAL LINE E007D # '๓ ฝ' TAG RIGHT CURLY BRACKET E007E # '๓ พ' TAG TILDE E007F # '๓ ฟ' CANCEL TAG E0080 E00FF # Unassigned\, block=No_Block [128] E0100 # '๓ „€' VARIATION SELECTOR-17 E0101 # '๓ „' VARIATION SELECTOR-18 E0102 # '๓ „‚' VARIATION SELECTOR-19 E0103 # '๓ „ƒ' VARIATION SELECTOR-20 E0104 # '๓ „„' VARIATION SELECTOR-21 E0105 # '๓ „…' VARIATION SELECTOR-22 E0106 # '๓ „†' VARIATION SELECTOR-23 E0107 # '๓ „‡' VARIATION SELECTOR-24 E0108 # '๓ „ˆ' VARIATION SELECTOR-25 E0109 # '๓ „‰' VARIATION SELECTOR-26 E010A # '๓ „Š' VARIATION SELECTOR-27 E010B # '๓ „‹' VARIATION SELECTOR-28 E010C # '๓ „Œ' VARIATION SELECTOR-29 E010D # '๓ „' VARIATION SELECTOR-30 E010E # '๓ „Ž' VARIATION SELECTOR-31 E010F # '๓ „' VARIATION SELECTOR-32 E0110 # '๓ „' VARIATION SELECTOR-33 E0111 # '๓ „‘' VARIATION SELECTOR-34 E0112 # '๓ „’' VARIATION SELECTOR-35 E0113 # '๓ „“' VARIATION SELECTOR-36 E0114 # '๓ „”' VARIATION SELECTOR-37 E0115 # '๓ „•' VARIATION SELECTOR-38 E0116 # '๓ „–' VARIATION SELECTOR-39 E0117 # '๓ „—' VARIATION SELECTOR-40 E0118 # '๓ „˜' VARIATION SELECTOR-41 E0119 # '๓ „™' VARIATION SELECTOR-42 E011A # '๓ „š' VARIATION SELECTOR-43 E011B # '๓ „›' VARIATION SELECTOR-44 E011C # '๓ „œ' VARIATION SELECTOR-45 E011D # '๓ „' VARIATION SELECTOR-46 E011E # '๓ „ž' VARIATION SELECTOR-47 E011F # '๓ „Ÿ' VARIATION SELECTOR-48 E0120 # '๓ „ ' VARIATION SELECTOR-49 E0121 # '๓ „ก' VARIATION SELECTOR-50 E0122 # '๓ „ข' VARIATION SELECTOR-51 E0123 # '๓ „ฃ' VARIATION SELECTOR-52 E0124 # '๓ „ค' VARIATION SELECTOR-53 E0125 # '๓ „ฅ' VARIATION SELECTOR-54 E0126 # '๓ „ฆ' VARIATION SELECTOR-55 E0127 # '๓ „ง' VARIATION SELECTOR-56 E0128 # '๓ „จ' VARIATION SELECTOR-57 E0129 # '๓ „ฉ' VARIATION SELECTOR-58 E012A # '๓ „ช' VARIATION SELECTOR-59 E012B # '๓ „ซ' VARIATION SELECTOR-60 E012C # '๓ „ฌ' VARIATION SELECTOR-61 E012D # '๓ „ญ' VARIATION SELECTOR-62 E012E # '๓ „ฎ' VARIATION SELECTOR-63 E012F # '๓ „ฏ' VARIATION SELECTOR-64 E0130 # '๓ „ฐ' VARIATION SELECTOR-65 E0131 # '๓ „ฑ' VARIATION SELECTOR-66 E0132 # '๓ „ฒ' VARIATION SELECTOR-67 E0133 # '๓ „ณ' VARIATION SELECTOR-68 E0134 # '๓ „ด' VARIATION SELECTOR-69 E0135 # '๓ „ต' VARIATION SELECTOR-70 E0136 # '๓ „ถ' VARIATION SELECTOR-71 E0137 # '๓ „ท' VARIATION SELECTOR-72 E0138 # '๓ „ธ' VARIATION SELECTOR-73 E0139 # '๓ „น' VARIATION SELECTOR-74 E013A # '๓ „บ' VARIATION SELECTOR-75 E013B # '๓ „ป' VARIATION SELECTOR-76 E013C # '๓ „ผ' VARIATION SELECTOR-77 E013D # '๓ „ฝ' VARIATION SELECTOR-78 E013E # '๓ „พ' VARIATION SELECTOR-79 E013F # '๓ „ฟ' VARIATION SELECTOR-80 E0140 # '๓ …€' VARIATION SELECTOR-81 E0141 # '๓ …' VARIATION SELECTOR-82 E0142 # '๓ …‚' VARIATION SELECTOR-83 E0143 # '๓ …ƒ' VARIATION SELECTOR-84 E0144 # '๓ …„' VARIATION SELECTOR-85 E0145 # '๓ ……' VARIATION SELECTOR-86 E0146 # '๓ …†' VARIATION SELECTOR-87 E0147 # '๓ …‡' VARIATION SELECTOR-88 E0148 # '๓ …ˆ' VARIATION SELECTOR-89 E0149 # '๓ …‰' VARIATION SELECTOR-90 E014A # '๓ …Š' VARIATION SELECTOR-91 E014B # '๓ …‹' VARIATION SELECTOR-92 E014C # '๓ …Œ' VARIATION SELECTOR-93 E014D # '๓ …' VARIATION SELECTOR-94 E014E # '๓ …Ž' VARIATION SELECTOR-95 E014F # '๓ …' VARIATION SELECTOR-96 E0150 # '๓ …' VARIATION SELECTOR-97 E0151 # '๓ …‘' VARIATION SELECTOR-98 E0152 # '๓ …’' VARIATION SELECTOR-99 E0153 # '๓ …“' VARIATION SELECTOR-100 E0154 # '๓ …”' VARIATION SELECTOR-101 E0155 # '๓ …•' VARIATION SELECTOR-102 E0156 # '๓ …–' VARIATION SELECTOR-103 E0157 # '๓ …—' VARIATION SELECTOR-104 E0158 # '๓ …˜' VARIATION SELECTOR-105 E0159 # '๓ …™' VARIATION SELECTOR-106 E015A # '๓ …š' VARIATION SELECTOR-107 E015B # '๓ …›' VARIATION SELECTOR-108 E015C # '๓ …œ' VARIATION SELECTOR-109 E015D # '๓ …' VARIATION SELECTOR-110 E015E # '๓ …ž' VARIATION SELECTOR-111 E015F # '๓ …Ÿ' VARIATION SELECTOR-112 E0160 # '๓ … ' VARIATION SELECTOR-113 E0161 # '๓ …ก' VARIATION SELECTOR-114 E0162 # '๓ …ข' VARIATION SELECTOR-115 E0163 # '๓ …ฃ' VARIATION SELECTOR-116 E0164 # '๓ …ค' VARIATION SELECTOR-117 E0165 # '๓ …ฅ' VARIATION SELECTOR-118 E0166 # '๓ …ฆ' VARIATION SELECTOR-119 E0167 # '๓ …ง' VARIATION SELECTOR-120 E0168 # '๓ …จ' VARIATION SELECTOR-121 E0169 # '๓ …ฉ' VARIATION SELECTOR-122 E016A # '๓ …ช' VARIATION SELECTOR-123 E016B # '๓ …ซ' VARIATION SELECTOR-124 E016C # '๓ …ฌ' VARIATION SELECTOR-125 E016D # '๓ …ญ' VARIATION SELECTOR-126 E016E # '๓ …ฎ' VARIATION SELECTOR-127 E016F # '๓ …ฏ' VARIATION SELECTOR-128 E0170 # '๓ …ฐ' VARIATION SELECTOR-129 E0171 # '๓ …ฑ' VARIATION SELECTOR-130 E0172 # '๓ …ฒ' VARIATION SELECTOR-131 E0173 # '๓ …ณ' VARIATION SELECTOR-132 E0174 # '๓ …ด' VARIATION SELECTOR-133 E0175 # '๓ …ต' VARIATION SELECTOR-134 E0176 # '๓ …ถ' VARIATION SELECTOR-135 E0177 # '๓ …ท' VARIATION SELECTOR-136 E0178 # '๓ …ธ' VARIATION SELECTOR-137 E0179 # '๓ …น' VARIATION SELECTOR-138 E017A # '๓ …บ' VARIATION SELECTOR-139 E017B # '๓ …ป' VARIATION SELECTOR-140 E017C # '๓ …ผ' VARIATION SELECTOR-141 E017D # '๓ …ฝ' VARIATION SELECTOR-142 E017E # '๓ …พ' VARIATION SELECTOR-143 E017F # '๓ …ฟ' VARIATION SELECTOR-144 E0180 # '๓ †€' VARIATION SELECTOR-145 E0181 # '๓ †' VARIATION SELECTOR-146 E0182 # '๓ †‚' VARIATION SELECTOR-147 E0183 # '๓ †ƒ' VARIATION SELECTOR-148 E0184 # '๓ †„' VARIATION SELECTOR-149 E0185 # '๓ †…' VARIATION SELECTOR-150 E0186 # '๓ ††' VARIATION SELECTOR-151 E0187 # '๓ †‡' VARIATION SELECTOR-152 E0188 # '๓ †ˆ' VARIATION SELECTOR-153 E0189 # '๓ †‰' VARIATION SELECTOR-154 E018A # '๓ †Š' VARIATION SELECTOR-155 E018B # '๓ †‹' VARIATION SELECTOR-156 E018C # '๓ †Œ' VARIATION SELECTOR-157 E018D # '๓ †' VARIATION SELECTOR-158 E018E # '๓ †Ž' VARIATION SELECTOR-159 E018F # '๓ †' VARIATION SELECTOR-160 E0190 # '๓ †' VARIATION SELECTOR-161 E0191 # '๓ †‘' VARIATION SELECTOR-162 E0192 # '๓ †’' VARIATION SELECTOR-163 E0193 # '๓ †“' VARIATION SELECTOR-164 E0194 # '๓ †”' VARIATION SELECTOR-165 E0195 # '๓ †•' VARIATION SELECTOR-166 E0196 # '๓ †–' VARIATION SELECTOR-167 E0197 # '๓ †—' VARIATION SELECTOR-168 E0198 # '๓ †˜' VARIATION SELECTOR-169 E0199 # '๓ †™' VARIATION SELECTOR-170 E019A # '๓ †š' VARIATION SELECTOR-171 E019B # '๓ †›' VARIATION SELECTOR-172 E019C # '๓ †œ' VARIATION SELECTOR-173 E019D # '๓ †' VARIATION SELECTOR-174 E019E # '๓ †ž' VARIATION SELECTOR-175 E019F # '๓ †Ÿ' VARIATION SELECTOR-176 E01A0 # '๓ † ' VARIATION SELECTOR-177 E01A1 # '๓ †ก' VARIATION SELECTOR-178 E01A2 # '๓ †ข' VARIATION SELECTOR-179 E01A3 # '๓ †ฃ' VARIATION SELECTOR-180 E01A4 # '๓ †ค' VARIATION SELECTOR-181 E01A5 # '๓ †ฅ' VARIATION SELECTOR-182 E01A6 # '๓ †ฆ' VARIATION SELECTOR-183 E01A7 # '๓ †ง' VARIATION SELECTOR-184 E01A8 # '๓ †จ' VARIATION SELECTOR-185 E01A9 # '๓ †ฉ' VARIATION SELECTOR-186 E01AA # '๓ †ช' VARIATION SELECTOR-187 E01AB # '๓ †ซ' VARIATION SELECTOR-188 E01AC # '๓ †ฌ' VARIATION SELECTOR-189 E01AD # '๓ †ญ' VARIATION SELECTOR-190 E01AE # '๓ †ฎ' VARIATION SELECTOR-191 E01AF # '๓ †ฏ' VARIATION SELECTOR-192 E01B0 # '๓ †ฐ' VARIATION SELECTOR-193 E01B1 # '๓ †ฑ' VARIATION SELECTOR-194 E01B2 # '๓ †ฒ' VARIATION SELECTOR-195 E01B3 # '๓ †ณ' VARIATION SELECTOR-196 E01B4 # '๓ †ด' VARIATION SELECTOR-197 E01B5 # '๓ †ต' VARIATION SELECTOR-198 E01B6 # '๓ †ถ' VARIATION SELECTOR-199 E01B7 # '๓ †ท' VARIATION SELECTOR-200 E01B8 # '๓ †ธ' VARIATION SELECTOR-201 E01B9 # '๓ †น' VARIATION SELECTOR-202 E01BA # '๓ †บ' VARIATION SELECTOR-203 E01BB # '๓ †ป' VARIATION SELECTOR-204 E01BC # '๓ †ผ' VARIATION SELECTOR-205 E01BD # '๓ †ฝ' VARIATION SELECTOR-206 E01BE # '๓ †พ' VARIATION SELECTOR-207 E01BF # '๓ †ฟ' VARIATION SELECTOR-208 E01C0 # '๓ ‡€' VARIATION SELECTOR-209 E01C1 # '๓ ‡' VARIATION SELECTOR-210 E01C2 # '๓ ‡‚' VARIATION SELECTOR-211 E01C3 # '๓ ‡ƒ' VARIATION SELECTOR-212 E01C4 # '๓ ‡„' VARIATION SELECTOR-213 E01C5 # '๓ ‡…' VARIATION SELECTOR-214 E01C6 # '๓ ‡†' VARIATION SELECTOR-215 E01C7 # '๓ ‡‡' VARIATION SELECTOR-216 E01C8 # '๓ ‡ˆ' VARIATION SELECTOR-217 E01C9 # '๓ ‡‰' VARIATION SELECTOR-218 E01CA # '๓ ‡Š' VARIATION SELECTOR-219 E01CB # '๓ ‡‹' VARIATION SELECTOR-220 E01CC # '๓ ‡Œ' VARIATION SELECTOR-221 E01CD # '๓ ‡' VARIATION SELECTOR-222 E01CE # '๓ ‡Ž' VARIATION SELECTOR-223 E01CF # '๓ ‡' VARIATION SELECTOR-224 E01D0 # '๓ ‡' VARIATION SELECTOR-225 E01D1 # '๓ ‡‘' VARIATION SELECTOR-226 E01D2 # '๓ ‡’' VARIATION SELECTOR-227 E01D3 # '๓ ‡“' VARIATION SELECTOR-228 E01D4 # '๓ ‡”' VARIATION SELECTOR-229 E01D5 # '๓ ‡•' VARIATION SELECTOR-230 E01D6 # '๓ ‡–' VARIATION SELECTOR-231 E01D7 # '๓ ‡—' VARIATION SELECTOR-232 E01D8 # '๓ ‡˜' VARIATION SELECTOR-233 E01D9 # '๓ ‡™' VARIATION SELECTOR-234 E01DA # '๓ ‡š' VARIATION SELECTOR-235 E01DB # '๓ ‡›' VARIATION SELECTOR-236 E01DC # '๓ ‡œ' VARIATION SELECTOR-237 E01DD # '๓ ‡' VARIATION SELECTOR-238 E01DE # '๓ ‡ž' VARIATION SELECTOR-239 E01DF # '๓ ‡Ÿ' VARIATION SELECTOR-240 E01E0 # '๓ ‡ ' VARIATION SELECTOR-241 E01E1 # '๓ ‡ก' VARIATION SELECTOR-242 E01E2 # '๓ ‡ข' VARIATION SELECTOR-243 E01E3 # '๓ ‡ฃ' VARIATION SELECTOR-244 E01E4 # '๓ ‡ค' VARIATION SELECTOR-245 E01E5 # '๓ ‡ฅ' VARIATION SELECTOR-246 E01E6 # '๓ ‡ฆ' VARIATION SELECTOR-247 E01E7 # '๓ ‡ง' VARIATION SELECTOR-248 E01E8 # '๓ ‡จ' VARIATION SELECTOR-249 E01E9 # '๓ ‡ฉ' VARIATION SELECTOR-250 E01EA # '๓ ‡ช' VARIATION SELECTOR-251 E01EB # '๓ ‡ซ' VARIATION SELECTOR-252 E01EC # '๓ ‡ฌ' VARIATION SELECTOR-253 E01ED # '๓ ‡ญ' VARIATION SELECTOR-254 E01EE # '๓ ‡ฎ' VARIATION SELECTOR-255 E01EF # '๓ ‡ฏ' VARIATION SELECTOR-256 E01F0 E0FFF # Unassigned\, block=No_Block [3600] END

p5pRT commented 12 years ago

From tchrist@perl.com

I never thought to check unassigned code points for properties. Hadn't realized there were 308 unassigned code points that already counted as PatSyn even though we don't know what they are yet. That now makes more sense as to how they can have an immutable set​: they carved out a fixed place to grow into.

No room in PatWS\, but LRM and RLM are \S.

(Well\, so is \cK\, but that's only because we haven't fixed that yet to make it white space in Perl the way it is in Unicode. Larry said he thought we should\, because it seemed like a bug that Perl's WS != Unicode's WS.)

Not sure what all the unassigned DI code points up in E0080โ€“E00FF or E01F0โ€“E0FFF are meant to be used for someday; more varriation selectors\, maybe?

--tom

p5pRT commented 12 years ago

From @nwc10

On Tue\, Feb 07\, 2012 at 12​:22​:30PM -0700\, Karl Williamson wrote​:

This solution is completely backwards compatible in the ASCII range. It is completely backwards compatible in the Latin1 range provided you aren't using unicode_strings. unicode_strings was never advertised as applying to quotemeta\, but it seems like a reasonable extension of its use to me; another alternative would be to come up with yet another feature\, say 'quote_unicode_strings'.

I don't see the approach of "yet another feature" as scaling. We'd likely as not be adding one new feature each year (per major release) as we find another small thing we'd like to regular the behaviour of.

The solution isn't backwards compatible above Latin1; nothing we do is\, unless we create a new feature.

"backwards" compatible or "bugwards" compatible? I'm finding it hard to think of a use case where it's going to make a difference whether quotemeta("ยฃ") is "ยฃ" or "\ยฃ"\, other than golden results in tests.

Nicholas Clark

p5pRT commented 12 years ago

From @khwilliamson

On 02/08/2012 04​:36 AM\, Nicholas Clark wrote​:

On Tue\, Feb 07\, 2012 at 12​:22​:30PM -0700\, Karl Williamson wrote​:

This solution is completely backwards compatible in the ASCII range. It is completely backwards compatible in the Latin1 range provided you aren't using unicode_strings. unicode_strings was never advertised as applying to quotemeta\, but it seems like a reasonable extension of its use to me; another alternative would be to come up with yet another feature\, say 'quote_unicode_strings'.

I don't see the approach of "yet another feature" as scaling. We'd likely as not be adding one new feature each year (per major release) as we find another small thing we'd like to regular the behaviour of.

I was hoping that would be people's sentiment about this. :)

The solution isn't backwards compatible above Latin1; nothing we do is\, unless we create a new feature.

"backwards" compatible or "bugwards" compatible? I'm finding it hard to think of a use case where it's going to make a difference whether quotemeta("ยฃ") is "ยฃ" or "\ยฃ"\, other than golden results in tests.

Totally agree.

So yet another option is to just fix the Unicode bug portion of this for now.

We could use unicode_strings as a flag for the upper Latin1 range characters. If it is off\, we treat them as we've always treated them​: quote them.

If it is on\, we treat them as we've always treated above-Latin1 range characters​: don't quote them.

Thus the only inconsistency is between non-unicode_strings and unicode_strings\, and we could leave for another time worrying about which of these we really want to quote going forwards.

p5pRT commented 12 years ago

From @khwilliamson

On 02/08/2012 10​:23 AM\, Karl Williamson wrote​:

On 02/08/2012 04​:36 AM\, Nicholas Clark wrote​:

On Tue\, Feb 07\, 2012 at 12​:22​:30PM -0700\, Karl Williamson wrote​:

This solution is completely backwards compatible in the ASCII range. It is completely backwards compatible in the Latin1 range provided you aren't using unicode_strings. unicode_strings was never advertised as applying to quotemeta\, but it seems like a reasonable extension of its use to me; another alternative would be to come up with yet another feature\, say 'quote_unicode_strings'.

I don't see the approach of "yet another feature" as scaling. We'd likely as not be adding one new feature each year (per major release) as we find another small thing we'd like to regular the behaviour of.

I was hoping that would be people's sentiment about this. :)

The solution isn't backwards compatible above Latin1; nothing we do is\, unless we create a new feature.

"backwards" compatible or "bugwards" compatible? I'm finding it hard to think of a use case where it's going to make a difference whether quotemeta("ยฃ") is "ยฃ" or "\ยฃ"\, other than golden results in tests.

Totally agree.

So yet another option is to just fix the Unicode bug portion of this for now.

We could use unicode_strings as a flag for the upper Latin1 range characters. If it is off\, we treat them as we've always treated them​: quote them.

If it is on\, we treat them as we've always treated above-Latin1 range characters​: don't quote them.

Thus the only inconsistency is between non-unicode_strings and unicode_strings\, and we could leave for another time worrying about which of these we really want to quote going forwards.

If we go the pattern syntax route\, I think we should quote the controls we wouldn't otherwise quote. This is the set of C1 controls (except NEL is already quoted)

p5pRT commented 12 years ago

From @khwilliamson

I have mostly implemented what I last proposed\, but attached is a doc patch for comment on how it actually plays out\, to verify that this seems like an acceptable approach.

I'm also thinking that under locale\, quotemeta should just quote \W for code points \< 256. I don't think it should be immune from locale\, or perhaps it doesn't much matter.

p5pRT commented 12 years ago

From @khwilliamson

0002-temp-for-comment.patch ```diff From 1607ec47dcb28ecd2687333d4f0d759eb0479312 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Sun, 12 Feb 2012 09:41:25 -0700 Subject: [PATCH 2/2] temp for comment --- pod/perlfunc.pod | 48 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 46 insertions(+), 2 deletions(-) diff --git a/pod/perlfunc.pod b/pod/perlfunc.pod index 591fa0d..ad8b7b5 100644 --- a/pod/perlfunc.pod +++ b/pod/perlfunc.pod @@ -4953,8 +4953,52 @@ input from the user, quotemeta() or C<\Q> must be used. In Perl v5.14, all non-ASCII characters are quoted in non-UTF-8-encoded strings, but not quoted in UTF-8 strings. -It is planned to change this behavior in v5.16, but the exact rules -haven't been determined yet. + +Starting in Perl v5.16, Perl adopted a Unicode-defined strategy +for quoting non-ASCII characters; the quoting of ASCII characters is +unchanged. + +Also unchanged is the quoting for non-UTF-8 strings when outside the +scope of a C, which is to quote all +characters in the upper Latin1 range. This provides complete backwards +compatibility for old programs which do not use Unicode (but note that +C is automatically enabled within the scope of a +S> or greater). + +Otherwise, Perl quotes non-ASCII characters using an adaptation from +Unicode (see L.) +The only code points that are quoted are those that have any of the +Unicode properties Pattern_Syntax, Pattern_White_Space, White_Space, +Default_Ignorable_Code_Point, or General_Category=Control. + +Of these properties, the two important ones are Pattern_Syntax and +Pattern_White_Space. They have been set up by Unicode for exactly this +purpose of deciding which characters in a regular expression pattern +should be quoted. No character that can be in an identifier has these +properties. + +Perl promises, that if we ever add regular expression pattern +metacharacters to the dozen already defined +(C<\ E ( ) [ { ^ $ * + ? .>), that we will only use ones that have the +Pattern_Syntax property. Perl also promises, that if we ever add +characters that are considered to be white space in regular expressions +(currently mostly affected by C), they will all have the +Pattern_White_Space property. + +Unicode promises that the set of code points that have these two +properties will never change, so something that is not quoted in v5.16 +will never need to be quoted in any future Perl release. (Not all the +code points that match Pattern_Syntax have actually had characters +assigned to them; so there is room to grow, but they are quoted +whether assigned or not. Perl, of course, would never use an +unassigned code point as an actual metacharacter.) + +Quoting characters that have the other 3 properties is done to enhance +the readability of the regular expression and not because they actually +need to be quoted (characters with the White_Space property are likely +to be indistinguishable on the page or screen from those with the +Pattern_White_Space property; and the other two properties contain +non-printing characters). =item rand EXPR X X -- 1.7.7.1 ```
p5pRT commented 12 years ago

From @rjbs

* Karl Williamson \public@&#8203;khwilliamson\.com [2012-02-12T11​:47​:28]

+Otherwise\, Perl quotes non-ASCII characters using an adaptation from +Unicode (see L\<http​://www.unicode.org/reports/tr31/>.) +The only code points that are quoted are those that have any of the +Unicode properties Pattern_Syntax\, Pattern_White_Space\, White_Space\, +Default_Ignorable_Code_Point\, or General_Category=Control. [...] +Perl promises\, that if we ever add regular expression pattern +metacharacters to the dozen already defined +(C\<\ E\ ( ) [ { ^ $ * + ? .>)\, that we will only use ones that have the +Pattern_Syntax property. Perl also promises\, that if we ever add

...and I see that all characters that are ASCII and Pattern_Syntax are already quoted by quotemeta. That comforts my initially-raised eyebrow.

Cool.

-- rjbs

p5pRT commented 12 years ago

From @khwilliamson

Now fixed by commit 2e2b25717dbde8d9ce48b4b8dc443e1d08166347 -- Karl Williamson

p5pRT commented 12 years ago

From [Unknown Contact. See original ticket]

Now fixed by commit 2e2b25717dbde8d9ce48b4b8dc443e1d08166347 -- Karl Williamson

p5pRT commented 12 years ago

@khwilliamson - Status changed from 'open' to 'resolved'