Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.89k stars 538 forks source link

eval() of non-ASCII bytes under unicode_eval and unicode_strings doesn't give them Latin1 meanings #9040

Open p5pRT opened 16 years ago

p5pRT commented 16 years ago

Migrated from rt.perl.org#45673 (status was 'open')

Searchable as RT45673$

p5pRT commented 16 years ago

From zefram@fysh.org

Created by zefram@fysh.org

$ perl -we '$a="require x\x{f1}y​::z"; eval $a; print $@​' Warning​: Use of "require" without parentheses is ambiguous at (eval 1) line 1. Unrecognized character \xF1 at (eval 1) line 1. $ perl -we '$a="require x\x{f1}y​::z"; utf8​::upgrade($a); eval $a; print $@​' Can't locate xZZy/z.pm in @​INC (@​INC contains​: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 .) at (eval 1) line 3. $

What I show above as "ZZ" was originally a sequence of two non-ASCII characters​: U+00c3 (Latin capital letter A with tilde) and U+00b1 (plus-minus sign). I've replaced them with ASCII characters to avoid unpredictable manglement.

The phenomenon we see here is that the syntax of Perl\, as judged by eval()\, varies according to whether the input string is physically encoded in UTF8. If it is so encoded then U+00f1\, Latin small letter N with tilde\, is an acceptable identifier character\, and so can be part of a module name. If not\, then the very same character is invalid in that context and causes a syntax error.

What\, exactly\, is Perl's identifier syntax? Is U+00f1 a valid identifier character?

Perl Info ``` Flags: category=core severity=low Site configuration information for perl v5.8.8: Configured by Debian Project at Wed Dec 6 23:17:41 UTC 2006. Summary of my perl5 (revision 5 version 8 subversion 8) configuration: Platform: osname=linux, osvers=2.6.18.3, archname=i486-linux-gnu-thread-multi uname='linux saens 2.6.18.3 #1 smp sat nov 25 13:39:52 est 2006 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.8 -Darchlib=/usr/lib/perl/5.8 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.8.8 -Dsitearch=/usr/local/lib/perl/5.8.8 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Uusesfio -Uusenm -Duseshrplib -Dlibperl=libperl.so.5.8.8 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBIAN -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.1.2 20061115 (prerelease) (Debian 4.1.1-20)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.3.6.so, so=so, useshrplib=true, libperl=libperl.so.5.8.8 gnulibc_version='2.3.6' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.8.8: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 . Environment for perl v5.8.8: HOME=/home/zefram LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/zefram/pub/i686-pc-linux-gnu/bin:/home/zefram/pub/common/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local/bin:/usr/games PERL_BADLANG (unset) SHELL=/usr/bin/zsh ```
p5pRT commented 16 years ago

From nospam-abuse@bloodgate.com

Moin\,

On Saturday 22 September 2007 23​:55​:20 Zefram wrote​:

# New Ticket Created by Zefram [snip]

$ perl -we '$a="require x\x{f1}y​::z"; eval $a; print $@​' Warning​: Use of "require" without parentheses is ambiguous at (eval 1) line 1. Unrecognized character \xF1 at (eval 1) line 1. $ perl -we '$a="require x\x{f1}y​::z"; utf8​::upgrade($a); eval $a; print $@​' Can't locate xZZy/z.pm in @​INC (@​INC contains​: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl /usr/local/lib/perl/5.8.4 /usr/local/share/perl/5.8.4 .) at (eval 1) line 3. $

What I show above as "ZZ" was originally a sequence of two non-ASCII characters​: U+00c3 (Latin capital letter A with tilde) and U+00b1 (plus-minus sign). I've replaced them with ASCII characters to avoid unpredictable manglement.

The sequence C3B1 is UTF-8 for "character 0xf1" so that is right.

The phenomenon we see here is that the syntax of Perl\, as judged by eval()\, varies according to whether the input string is physically encoded in UTF8. If it is so encoded then U+00f1\, Latin small letter N with tilde\, is an acceptable identifier character\, and so can be part of a module name. If not\, then the very same character is invalid in that context and causes a syntax error.

What\, exactly\, is Perl's identifier syntax? Is U+00f1 a valid identifier character?

When you don't do "use utf8;" you script is expected to be in latin1 (iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8\, it can contain any UTF-8.

However\, it seems eval() (or require?) doesn't know about this. Plus\, I am not entirely sure how much Unicode you can use in identifiers as something like this​:

  #!perl   use utf8;   my $€ = 1;

still fails to compile with​:

  Unrecognized character \x82 at t.pl line 5.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

perldoc utf8 says​:

  Enabling the "utf8" pragma has the following effect​:

  Bytes in the source text that have their high‐bit set will be   treated as being part of a literal UTF−8 character. This   includes most literals such as identifier names\, string   constants\, and constant regular expression patterns.

But it doesn't seem to work in v5.8.8 at least.

All the best\,

Tels

-- Signed on Sun Sep 23 18​:05​:15 2007 with key 0x93B84C15. Get one of my photo posters​: http​://bloodgate.com/posters PGP key on http​://bloodgate.com/tels.asc or per email.

"Spammed if you do\, spammed if you don't."

  -- Murphy's Law

p5pRT commented 16 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 16 years ago

From @rgs

On 23/09/2007\, Tels \nospam\-abuse@​bloodgate\.com wrote​:

When you don't do "use utf8;" you script is expected to be in latin1 (iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8\, it can contain any UTF-8.

However\, it seems eval() (or require?) doesn't know about this.

Right\, there can be double encoding. That will need to be fixed.

Plus\, I am not entirely sure how much Unicode you can use in identifiers as something like this​:

    \#\!perl
    use utf8;
    my $€ = 1;

still fails to compile with​:

    Unrecognized character \\x82 at t\.pl line 5\.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

Identifiers must start with letters; € isn't one.

[rafael@​stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à' 42 [rafael@​stcosmo ~]$ bleadperl -le '$à=42;print $à' Unrecognized character \xA0 in column 3 at -e line 1.

p5pRT commented 16 years ago

From nospam-abuse@bloodgate.com

Moin\,

On Monday 24 September 2007 10​:42​:37 Rafael Garcia-Suarez wrote​:

On 23/09/2007\, Tels \nospam\-abuse@​bloodgate\.com wrote​:

When you don't do "use utf8;" you script is expected to be in latin1 (iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8\, it can contain any UTF-8.

However\, it seems eval() (or require?) doesn't know about this.

Right\, there can be double encoding. That will need to be fixed.

Ok.

Plus\, I am not entirely sure how much Unicode you can use in identifiers as something like this​:

    \#\!perl
    use utf8;
    my $€ = 1;

still fails to compile with​:

    Unrecognized character \\x82 at t\.pl line 5\.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

Identifiers must start with letters; € isn't one.

Wouldn't perlsyn be a good place to document this tidbit\, then?

And\, of course\, I tried that with "$a€"\, too\, see below :P

[rafael@​stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à' 42 [rafael@​stcosmo ~]$ bleadperl -le '$à=42;print $à' Unrecognized character \xA0 in column 3 at -e line 1.

v5.8.8​:

  # perl -Mutf8 -le '$à=42;print $à'   42   # perl -Mutf8 -le '$aà=42;print $aà'   42   # perl -Mutf8 -le '$a€=42;print $a€'   Unrecognized character \xE2 at -e line 1.   # perl -Mutf8 -le '$€=42;print $€'   Unrecognized character \x82 at -e line 1.

That mighty Euro seems to be special\, it is not allowed even after a letter\, and it's sometimes recognized as \x82 and sometimes as \xE2. Huh?

All the best\,

Tels

-- Signed on Mon Sep 24 14​:54​:04 2007 with key 0x93B84C15. View my photo gallery​: http​://bloodgate.com/photos PGP key on http​://bloodgate.com/tels.asc or per email.

"Not King yet."

p5pRT commented 16 years ago

From @Juerd

Rafael Garcia-Suarez skribis 2007-09-24 10​:42 (+0200)​:

    use utf8;
    my $€ = 1;

still fails to compile with​: Unrecognized character \x82 at t.pl line 5. Identifiers must start with letters; € isn't one.

Still\, the character is not \x82 but \x{20ac}\, so the error message is incorrect.

\x82 isn't even the first byte of the UTF-8 encoding of \x{20ac}. It's the second. Perhaps the first byte (\xe2) is accepted as latin1\, even though utf8.pm is in effect. -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

  Juerd Waalboer​: Perl hacker \\#\#\#\#\#@&#8203;juerd\.nl \<http​://juerd.nl/sig>   Convolution​: ICT solutions and consultancy \sales@&#8203;convolution\.nl

p5pRT commented 16 years ago

From @Juerd

Tels skribis 2007-09-24 14​:58 (+0200)​:

Identifiers must start with letters; € isn't one. Wouldn't perlsyn be a good place to document this tidbit\, then? And\, of course\, I tried that with "$a€"\, too\, see below :P

There's more to it than just the first character.

IIRC\, identifiers are [[​:alpha​:]_]\w+

Euro isn't in there. -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

  Juerd Waalboer​: Perl hacker \\#\#\#\#\#@&#8203;juerd\.nl \<http​://juerd.nl/sig>   Convolution​: ICT solutions and consultancy \sales@&#8203;convolution\.nl

p5pRT commented 14 years ago

From @chipdude

On Mon\, Sep 24\, 2007 at 10​:42​:37AM +0200\, Rafael Garcia-Suarez wrote​:

On 23/09/2007\, Tels \nospam\-abuse@&#8203;bloodgate\.com wrote​:

When you don't do "use utf8;" you script is expected to be in latin1 (iso.-8859-1). (we leave "use locale" out of this for now). Under use utf8\, it can contain any UTF-8.

However\, it seems eval() (or require?) doesn't know about this.

Right\, there can be double encoding. That will need to be fixed.

I disagree. The only extra encoding is manual here. The call to utf8​::upgrade is performing that encoding step explicitly at the user's request.

I call no bug.

Plus\, I am not entirely sure how much Unicode you can use in identifiers as something like this​:

    \#\!perl
    use utf8;
    my $? = 1;

still fails to compile with​:

    Unrecognized character \\x82 at t\.pl line 5\.

perldoc perlsyn (in 5.8.8) doesn't seem to say anything about identifiers.

Identifiers must start with letters; ? isn't one.

[rafael@​stcosmo ~]$ bleadperl -Mutf8 -le '$à=42;print $à' 42 [rafael@​stcosmo ~]$ bleadperl -le '$à=42;print $à' Unrecognized character \xA0 in column 3 at -e line 1.

-- Chip Salzenberg

p5pRT commented 14 years ago

From zefram@fysh.org

Chip Salzenberg wrote​:

I disagree. The only extra encoding is manual here. The call to utf8​::upgrade is performing that encoding step explicitly at the user's request.

utf8​::upgrade isn't a user-visible encoding step. It changes how the string is represented internally\, but leaves the string containing the same characters as before. Later operations on the string ought to be responding to the same character sequence in the same way. Now\, if I'd done Encode​::encode("UTF-8"\, ...)\, *that* would be a manual\, explicitly-requested\, encoding step\, and I'd expect that to produce a string with different behaviour from the input.

-zefram

p5pRT commented 14 years ago

From @chipdude

On Wed\, Aug 26\, 2009 at 11​:19​:39PM +0100\, Zefram wrote​:

Chip Salzenberg wrote​:

I disagree. The only extra encoding is manual here. The call to utf8​::upgrade is performing that encoding step explicitly at the user's request.

utf8​::upgrade isn't a user-visible encoding step. It changes how the string is represented internally [...]

You have just agreed with me. "Change of representation" = "encoding".

Perl's parser takes bytes and gives them meaning. If you change the bytes\, you can't expect Perl's parser to ignore that. -- Chip Salzenberg

p5pRT commented 14 years ago

From zefram@fysh.org

Chip Salzenberg wrote​:

You have just agreed with me. "Change of representation" = "encoding".

utf8​::upgrade affects *internal* encoding. Not the user-visible content of the string.

Perl's parser takes bytes and gives them meaning. If you change the bytes\, you can't expect Perl's parser to ignore that.

String eval is an operation on a string. A string of *characters*\, in current Perl. The Perl parser claims to ascribe meaning to characters\, not to bytes per se.

Obviously it's internally working with bytes. A Perl source file on disk is really a sequence of bytes\, and the interpretation of those bytes as characters is influenced by the "use utf8" pragma. In the case of string eval\, the Perl string object already knows what characters it represents\, so without any pragma it already knows whether the internal byte sequence needs to be interpreted via Latin-1 or UTF-8. ("use utf8" in a string eval seems meaningless.)

I believe the bug here is that the Perl parser is not consistently responding to the character sequence. This is presumably due to it being implemented at the byte level\, with insufficient abstraction.

-zefram

p5pRT commented 14 years ago

From @chipdude

On Wed\, Aug 26\, 2009 at 11​:47​:11PM +0100\, Zefram wrote​:

Chip Salzenberg wrote​:

You have just agreed with me. "Change of representation" = "encoding".

utf8​::upgrade affects *internal* encoding. Not the user-visible content of the string.

"User-visible" is a vague term\, because the utf8 flag *is* visible.

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so\, then it's a documentation bug.

I believe the bug here is that the Perl parser is not consistently responding to the character sequence. This is presumably due to it being implemented at the byte level\, with insufficient abstraction.

That's not a bug\, it's a feature. (I'm mostly serious about that.) And it's not worth "fixing". (I'm entirely serious about that.) -- Chip Salzenberg

p5pRT commented 14 years ago

From zefram@fysh.org

Chip Salzenberg wrote​:

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so\, then it's a documentation bug.

I refer to\, for example\, perldata(1)​:

  Values are usually referred to by name\, or through a named reference.   [...] Usually this name is a single identifier\, that is\, a string   beginning with a letter or underscore\, and containing letters\,   underscores\, and digits.

Clearly referring to characters there\, not bytes. It's not so clear about what qualifies as a "letter". perlunicode(1) expounds a bit more​:

  If an appropriate encoding is specified\, identifiers within the   Perl script may contain Unicode alphanumeric characters\, including   ideographs. Perl does not currently attempt to canonicalize   variable names.

The internal Latin-1 encoding of a downgraded string seems an appropriate encoding for the representation of U+f1\, a Unicode letter character.

That's not a bug\, it's a feature. (I'm mostly serious about that.)

I don't see how the inconsistency can ever be a good thing.

-zefram

p5pRT commented 14 years ago

From @demerphq

2009/8/27 Chip Salzenberg \chip@&#8203;pobox\.com​:

On Wed\, Aug 26\, 2009 at 11​:47​:11PM +0100\, Zefram wrote​:

Chip Salzenberg wrote​:

You have just agreed with me. "Change of representation" = "encoding".

utf8​::upgrade affects *internal* encoding. Not the user-visible content of the string.

"User-visible" is a vague term\, because the utf8 flag *is* visible.

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so\, then it's a documentation bug.

No. It is *not*. We operate on character not bytes. Bytes are meaningless. Characters are not. We go to *great* trouble to operate on characters not on bytes. Reverting to bytes is a HUGE step backwards and contradict MANY MANY things that happened in perl 5.10 and were planned for perl 5.12 and were in core long before either. Some things to consider​: the regex engine operates on characters. The behaviour of Perl on EBCDIC machines should be the same as it is on latin-1\, or unicode machines. Thus to claim that perl operates on bytes contradicts MASSIVE amounts of code in the core.

I believe the bug here is that the Perl parser is not consistently responding to the character sequence. This is presumably due to it being implemented at the byte level\, with insufficient abstraction.

That's not a bug\, it's a feature. (I'm mostly serious about that.)

No it is not. It is a bug.

And it's not worth "fixing". (I'm entirely serious about that.)

I dont agree. It *is* worth fixing.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @chipdude

On Thu\, Aug 27\, 2009 at 12​:19​:26AM +0100\, Zefram wrote​:

Chip Salzenberg wrote​:

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so\, then it's a documentation bug.

I refer to\, for example\, perldata(1) [and] perlunicode(1)

It does appear that L\<perlfunc/eval> needs a note.

That's not a bug\, it's a feature. (I'm mostly serious about that.) I don't see how the inconsistency can ever be a good thing.

Calling it "inconsistency" misses the point. The C\ operator simply never got a byte->character behavioral change when much (not all!) of the rest of Perl did. Consider this\, from perlfunc​:

  do EXPR Uses the value of EXPR as a filename and executes the   contents of the file as a Perl script.   do 'stat.pl';   is just like   eval `cat stat.pl`;   except that it's more efficient and concise\, keeps track of the   current filename for error messages\, searches the @​INC   directories\, and updates %INC if the file is found.

There's no way ever to fully lift parsing out of the world of bytes. I think that's OK. -- Chip Salzenberg

p5pRT commented 14 years ago

From @demerphq

2009/8/27 Chip Salzenberg \chip@&#8203;pobox\.com​:

On Thu\, Aug 27\, 2009 at 12​:19​:26AM +0100\, Zefram wrote​:

Chip Salzenberg wrote​:

The Perl parser claims to ascribe meaning to characters [...]

Does it? If so\, then it's a documentation bug.

I refer to\, for example\, perldata(1) [and] perlunicode(1)

It does appear that L\<perlfunc/eval> needs a note.

That's not a bug\, it's a feature. (I'm mostly serious about that.) I don't see how the inconsistency can ever be a good thing.

Calling it "inconsistency" misses the point. The C\ operator simply never got a byte->character behavioral change when much (not all!) of the rest of Perl did. Consider this\, from perlfunc​:

  do EXPR Uses the value of EXPR as a filename and executes the
      contents of the file as a Perl script\.
          do 'stat\.pl';
      is just like
          eval \`cat stat\.pl\`;
      except that it's more efficient and concise\, keeps track of the
      current filename for error messages\, searches the @&#8203;INC
      directories\, and updates %INC if the file is found\.

There's no way ever to fully lift parsing out of the world of bytes. I think that's OK.

I think *this* documentation is buggy. Not the other way around.

The exact same file /in terms of bytes/ will NOT do the same thing on EBCDIC as it will on non EBCDIC. The exact same sequence of bytes will not match the same way if it is "unicode" or "non-unicode"\, if we stop paying attention to encoding we will end up in a very very serious world of pain. The plan /was/ to revert to unicode semantics in ALL respects. This means that bytes are irrelevant\, characters are. Not following through on this plan would be IMO a huge step backwards.

We have debated on p5p the subtleties of encoding\, characters\, semantics\, etc in the last few years\, and came to some kind of general consensus that the way forward was to assume full unicode semantics at every level\, as every other option sucks much much worse. Perhaps you missed these debates\, or their conclusions. I for one would not welcome reopening these debates.

As I said earlier\, bytes are meaningless. They are numbers. We dont code in numbers\, we code in characters. To go back to thinking of code as numbers would be like returning to the dark ages from the age of enlightenment.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @jandubois

On Wed\, 26 Aug 2009\, demerphq wrote​:

We have debated on p5p the subtleties of encoding\, characters\, semantics\, etc in the last few years\, and came to some kind of general consensus that the way forward was to assume full unicode semantics at every level\, as every other option sucks much much worse. Perhaps you missed these debates\, or their conclusions. I for one would not welcome reopening these debates.

Did anybody summarize these conclusions somewhere? Or can you at least point to the key list messages that give an overview on what was agreed to before?

Getting to "full Unicode semantics at every level" sounds like a huge undertaking. Unless we get rid of the SvUTF8 flag and indiscriminately store all strings internally as UTF8\, we would have to modify virtually *all* APIs that currently take char* arguments and replace them with SV*s\, including all the OS level wrappings\, like access to the environment and file system.

Cheers\, -Jan

p5pRT commented 14 years ago

From @demerphq

2009/8/27 Jan Dubois \jand@&#8203;activestate\.com​:

On Wed\, 26 Aug 2009\, demerphq wrote​:

We have debated on p5p the subtleties of encoding\, characters\, semantics\, etc in the last few years\, and came to some kind of general consensus that the way forward was to assume full unicode semantics at every level\, as every other option sucks much much worse. Perhaps you missed these debates\, or their conclusions. I for one would not welcome reopening these debates.

Did anybody summarize these conclusions somewhere? Or can you at least point to the key list messages that give an overview on what was agreed to before?

Ill try to put together a summary. The general agreement concerned using unicode case foliding rules everywhere and eliminating the nasty "latin1" versus "unciode" difference in behaviour in various subsystems. Perhaps "everywhere" is too general.

Getting to "full Unicode semantics at every level" sounds like a huge undertaking. Unless we get rid of the SvUTF8 flag and indiscriminately store all strings internally as UTF8\, we would have to modify virtually *all* APIs that currently take char* arguments and replace them with SV*s\, including all the OS level wrappings\, like access to the environment and file system.

Is that not a good thing? Forget the amount of work for a moment. What is the right design decision?

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @rgs

2009/8/27 Chip Salzenberg \chip@&#8203;pobox\.com​:

On Wed\, Aug 26\, 2009 at 11​:19​:39PM +0100\, Zefram wrote​:

Chip Salzenberg wrote​:

I disagree.  The only extra encoding is manual here.  The call to utf8​::upgrade is performing that encoding step explicitly at the user's request.

utf8​::upgrade isn't a user-visible encoding step.  It changes how the string is represented internally [...]

You have just agreed with me.  "Change of representation" = "encoding".

Well no it's not. The UTF8 flag shouldn't have any effect on anything at perl level. It does\, currently\, and that can't be changed without breaking backwards compatibility (notably in uc\, lc\, //i...)\, but I think it's important enough to be changed for 5.12.

Perl's parser takes bytes and gives them meaning.  If you change the bytes\, you can't expect Perl's parser to ignore that.

That's a bug in my book. Perl's parser (or to be more precise\, tokenizer) should take characters\, not bytes; it doesn't always.

Perltodo states :

=head2 Properly Unicode safe tokeniser and pads.

The tokeniser isn't actually very UTF-8 clean. C\<use utf8;> is a hack - variable names are stored in stashes as raw bytes\, without the utf-8 flag set. The pad API only takes a C\<char *> pointer\, so that's all bytes too. The tokeniser ignores the UTF-8-ness of C\<PL_rsfp>\, or any SVs returned from source filters. All this could be fixed.

p5pRT commented 14 years ago

From @davidnicol

On Wed\, Aug 26\, 2009 at 7​:16 PM\, Jan Dubois\jand@&#8203;activestate\.com wrote​:

Getting to "full Unicode semantics at every level" sounds like a huge undertaking. Unless we get rid of the SvUTF8 flag and indiscriminately store all strings internally as UTF8\, we would have to modify virtually *all* APIs that currently take char* arguments and replace them with SV*s\, including all the OS level wrappings\, like access to the environment and file system.

Cheers\, -Jan

Except that the competition\, by which I mean at least Python and e262\, are already there.

Yes\, major reengineering is required\, nobody wants to try to cross an ocean in a knarr made entirely of band-aids.

Completely decoupled byte and character storage implementations means rethinking how scalar values are represented.

It seems like a good way to get there is by\, for instance\, perl to e262 translation software\, taking v8+k7 as a compilation target. With perl 5.14 as a front-end to a different abstraction\, further debate could be limited to feature design and implementation.

p5pRT commented 14 years ago

From @chipdude

On Thu\, Aug 27\, 2009 at 02​:43​:51AM +0200\, demerphq wrote​:

The general agreement concerned using unicode case foliding rules everywhere and eliminating the nasty "latin1" versus "unciode" difference in behaviour in various subsystems.

The robot devil is\, as always\, in the details.

Perhaps "everywhere" is too general.

Indeed. In the specific case of eval\, for example\, idealism and the gains (whatever they may be) from Unicode-aware parsing should be measured against the bedrock fact that Perl assumes\, deeply\, that source code is a sequence of bytes.

If the limited goal is to make utf8​::ugrade a NOP -- basically to make our lexer work with a series of characters\, none of which falls outside the range 0-255 -- nothing of any real value is gained.

If the ambitious goal is to allow the lexer to identify and use arbitrary Unicode characters\, well\, first that's a big job (not that I can tell anyone how to spend their time); but second\, it *still* gains us nothing of any real value. Our lexer and parser are entirely happy with multi-byte operators like "cmp". No significant work is required to allow them to work with multi-byte operators that happen to be the UTF-8 sequence for \N{OPEN SMILEY FACE}.

Finally\, and most importantly\, consider​:

What does C\ *mean* inside an eval of a utf8 string? How about C\? If you're about to tell me that the rules will be X if it's utf8 string and Y if it isn't\, then you've broken the very identity that you wanted to retain!

In short​: Forget big fish vs. small fish. This isn't even a fish\, it's just a painting of a fish. Ceci n'est pas un fish. Let's go find real fish to fry. -- Chip Salzenberg

p5pRT commented 14 years ago

From @khwilliamson

demerphq wrote​:

2009/8/27 Jan Dubois \jand@&#8203;activestate\.com​:

On Wed\, 26 Aug 2009\, demerphq wrote​:

We have debated on p5p the subtleties of encoding\, characters\, semantics\, etc in the last few years\, and came to some kind of general consensus that the way forward was to assume full unicode semantics at every level\, as every other option sucks much much worse. Perhaps you missed these debates\, or their conclusions. I for one would not welcome reopening these debates. Did anybody summarize these conclusions somewhere? Or can you at least point to the key list messages that give an overview on what was agreed to before?

Ill try to put together a summary. The general agreement concerned using unicode case foliding rules everywhere and eliminating the nasty "latin1" versus "unciode" difference in behaviour in various subsystems. Perhaps "everywhere" is too general.

One principal that I think there was consensus on (and I certainly hope no one disputes it) is that the the way Perl stores something internally should have no effect on the user-level semantics (unless one is really digging\, like in 'use bytes'). This is sadly not currently the case.

p5pRT commented 14 years ago

From marvin@rectangular.com

On Fri\, Aug 28\, 2009 at 07​:52​:51AM -0600\, karl williamson wrote​:

One principal that I think there was consensus on (and I certainly hope no one disputes it) is that the the way Perl stores something internally should have no effect on the user-level semantics (unless one is really digging\, like in 'use bytes'). This is sadly not currently the case.

Practically speaking\, I think it's unrealistic to do anything ambitious with UTF-8 without understanding the role of the SVf_UTF8 flag and being able to troubleshoot using Devel​::Peek\, etc. There are too many opportunities to make mistakes.

That said\, I'm grateful that for those of us who have that expertise\, it *is* possible to do ambitious things with UTF-8. :) I understand the backwards compatibility constraints under which the system was designed\, and I'm very impressed by what was achieved.

Looking forward... what if this directive implied a source file encoding of UTF-8? :)

  use 5.012;

Marvin Humphrey

p5pRT commented 14 years ago

From @nwc10

On Fri\, Aug 28\, 2009 at 08​:00​:04AM -0700\, Marvin Humphrey wrote​:

Looking forward... what if this directive implied a source file encoding of UTF-8? :)

use 5\.012;

What would I use\, if I had a script written in some other encoding\, but needed to enforce a requirement for v5.12.0 or later?

Nicholas Clark

p5pRT commented 14 years ago

From marvin@rectangular.com

On Fri\, Aug 28\, 2009 at 04​:02​:19PM +0100\, Nicholas Clark wrote​:

Looking forward... what if this directive implied a source file encoding of UTF-8? :)

use 5\.012;

What would I use\, if I had a script written in some other encoding\, but needed to enforce a requirement for v5.12.0 or later?

Obviously\, I am implying that such a script would need to be updated. You are already modding it by adding the "use" directive\, no?

Marvin Humphrey

p5pRT commented 14 years ago

From @nwc10

On Fri\, Aug 28\, 2009 at 08​:04​:53AM -0700\, Marvin Humphrey wrote​:

On Fri\, Aug 28\, 2009 at 04​:02​:19PM +0100\, Nicholas Clark wrote​:

Looking forward... what if this directive implied a source file encoding of UTF-8? :)

use 5\.012;

What would I use\, if I had a script written in some other encoding\, but needed to enforce a requirement for v5.12.0 or later?

Obviously\, I am implying that such a script would need to be updated. You are already modding it by adding the "use" directive\, no?

This would reduce functionality by conflating two orthogonal features. This reduces choice. I think that forcing the conversion is a bad thing.

Particularly as use 5.010 had no such semantic overloading.

Nicholas Clark

p5pRT commented 14 years ago

From zefram@fysh.org

Nicholas Clark wrote​:

This would reduce functionality by conflating two orthogonal features. This reduces choice. I think that forcing the conversion is a bad thing.

+1

Particularly as use 5.010 had no such semantic overloading.

$ perl -e 'say "foo"' Unquoted string "say" may clash with future reserved word at -e line 1. String found where operator expected at -e line 1\, near "say "foo""   (Do you need to predeclare say?) syntax error at -e line 1\, near "say "foo"" Execution of -e aborted due to compilation errors. $ perl -e 'use 5.010; say "foo"' foo

I'm afraid that boat has already sailed.

-zefram

p5pRT commented 12 years ago

From @rjbs

How does the creation of evalbytes and unicode_eval affect this ticket\, if at all?

p5pRT commented 12 years ago

From @cpansprout

On Thu Mar 01 18​:46​:16 2012\, rjbs wrote​:

How does the creation of evalbytes and unicode_eval affect this ticket\, if at all?

It isn’t enough. I believe the patches for which #107008 exists will fix it\, but if I knew for certain that would only be because I had already integrated them (which I haven’t). :-)

--

Father Chrysostomos

p5pRT commented 12 years ago

From @cpansprout

On Fri Mar 02 08​:59​:58 2012\, sprout wrote​:

On Thu Mar 01 18​:46​:16 2012\, rjbs wrote​:

How does the creation of evalbytes and unicode_eval affect this ticket\, if at all?

It isn’t enough. I believe the patches for which #107008 exists will fix it\, but if I knew for certain that would only be because I had already integrated them (which I haven’t). :-)

The example shown in the original post still fails the same way. It probably has something to do with require() being a syscall\, so it doesn’t respect utf8-ness (see the tickets linked to #105914). However\, require() should be able to preserve the utf8ness at least for reporting failure. So this particular example is not resolved yet (nor do I think it pressing enough for 5.16). But the bug described in the original post (ignoring the example given) is fixed.

--

Father Chrysostomos

khwilliamson commented 2 years ago

This is still a problem in 5.35.10, and to avoid having to read through the ticket, the real issue is the first example in the OP post, but not what I think people have said. The claim is that source without 'use utf8' is presumed encoded as Latin1. But in fact non-ASCII bytes are not assumed to be Latin1, but just anonymous bytes with no properties except for their code points and that they aren't \w, aren't \s, aren't controls .... But one would think that unicode_eval or unicode_strings would change these to their Latin1 values, but that doesn't happen