Perl / perl5

🐪 The Perl programming language

https://dev.perl.org/perl5/

Other

1.85k stars 527 forks source link

Inconsistent and wrong handling of 8th bit set chars with no locale #9455

Closed p5pRT closed 13 years ago

p5pRT commented 15 years ago

Migrated from rt.perl.org#58182 (status was 'resolved')

Searchable as RT58182$

p5pRT commented 15 years ago

From @khwilliamson

This is a bug report for perl from corporate@khwilliamson.com\, generated with the help of perlbug 1.36 running under perl 5.10.0.

Characters in the range U+0080 through U+00FF behave inconsistently depending on whether or not they are part of a string which also includes a character above that range\, and in some cases they behave incorrectly even when part of such a string. The problems I will concentrate on in this report are those involving case.

I presume that they do work properly when a locale is set\, but I haven't tested that.

print uc("\x{e0}")\, "\n"; # (a with grave accent)

yields itself instead of a capital A with grave accent (U+00C0). This is true whether or not the character is part of a string which includes a character not storable in a single byte. Similarly

print "\x{e0}" =~ /\x{c0}/i\, "\n";

will print a null string on a line\, as the match fails.

The same behavior occurs for all characters in this range that are marked in the Unicode standard as lower case and have single letter upper case equivalents.

The behavior that is inconsistent mostly occurs with upper case letters being mapped to lower case.

print lcfirst("\x{c0}aaaaa")\, "\n";

doesn't change the first character. But

print lcfirst("\x{c0}aaaaa\x{101}")\, "\n";

does change it. There is something seriously wrong when a character separated by an arbitrarily large distance from another one can affect what case the latter is considered to be. Similarly\,

print "\x{c0}aaaaaa" =~ /^\x{e0}/i\, "\n";

will show the match failing\, but

print "\x{c0}aaaaaa\x{101}" =~ /^\x{e0}/i\, "\n";

will show the match succeeding. Again a character maybe hundreds of positions further along in a string can affect whether the first character in said string matches its lower case equivalent when case is ignored.

The same behavior occurs for all characters in this range that are marked in the Unicode standard as upper case and have lower case equivalents\, as well as U+00DF which is lower case and has an upper case equivalent of the string 'SS'.

Also\, the byte character classes inconsistently match characters in this range\, again depending on whether or not the character is part of a larger string that contains a character greater than the range. So\, for example\, for a non-breaking space\,

print "\xa0" =~ /^\s/\, "\n";

will show that the match returns false but

print "\xa0\x{101}" =~ /^\s/\, "\n";

will show that the match returns true. But this behavior is sort-of documented\, and there is a work-around\, which is to use the '\p{}' classes instead. Note that calling them byte character classes is wrong; they really are 7-bit classes.

From reading the documentation\, I presume that the inconsistent behavior is a result of the decision to have perl not switch to wide-character mode in storing its strings unless necessary. I like that decision for efficiency reasons. But what has happened is that the code points in the range 128 - 255 have been orphaned\, when they aren't part of strings that force the switch. Again\, I presume but haven't tested\, that using a locale causes them to work properly for that locale\, but in the absence of a locale they should be treated as Unicode code points (or equivalently for characters in this range\, as iso-8859-1). Storing as wide-characters is supposed to be transparent to users\, but this bug belies that and yields very inconsistent and unexpected behavior. (This doesn't explain the lower to upper case translation bug\, which is wrong even in wide-character mode.)

I am frankly astonished that this bug exists\, as I have come to expect perl to "Do the Right Thing" over the course of many years of using it. I did see one bug report of something similar to this when searching for this\, but it apparently was misunderstood and went nowhere\, and wasn't in the perl bug data base

Flags: category=core severity=high

Site configuration information for perl 5.10.0:

Configured by ActiveState at Wed May 14 05:06:16 PDT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=linux\, osvers=2.4.21-297-default\, archname=i686-linux-thread-multi uname='linux gila 2.4.21-297-default #1 sat jul 23 07:47:39 utc 2005 i686 i686 i386 gnulinux ' config_args='-ders -Dcc=gcc -Dusethreads -Duseithreads -Ud_sigsetjmp -Uinstallusrbinperl -Ulocincpth= -Uloclibpth= -Accflags=-DUSE_SITECUSTOMIZE -Duselargefiles -Accflags=-DPRIVLIB_LAST_IN_INC -Dprefix=/opt/ActivePerl-5.10 -Dprivlib=/opt/ActivePerl-5.10/lib -Darchlib=/opt/ActivePerl-5.10/lib -Dsiteprefix=/opt/ActivePerl-5.10/site -Dsitelib=/opt/ActivePerl-5.10/site/lib -Dsitearch=/opt/ActivePerl-5.10/site/lib -Dsed=/bin/sed -Duseshrplib -Dcf_by=ActiveState -Dcf_email=support@ActiveState.com' hint=recommended\, useposix=true\, d_sigaction=define useithreads=define\, usemultiplicity=define useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='gcc'\, ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2'\, cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -fno-strict-aliasing -pipe' ccversion=''\, gccversion='3.3.1 (SuSE Linux)'\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=4\, prototype=define Linker and Libraries: ld='gcc'\, ldflags ='' libpth=/lib /usr/lib /usr/local/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=\, so=so\, useshrplib=true\, libperl=libperl.so gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E -Wl\,-rpath\,/opt/ActivePerl-5.10/lib/CORE' cccdlflags='-fPIC'\, lddlflags='-shared -O2'

Locally applied patches: ACTIVEPERL_LOCAL_PATCHES_ENTRY 33741 avoids segfaults invoking S_raise_signal() (on Linux) 33763 Win32 process ids can have more than 16 bits 32809 Load 'loadable object' with non-default file extension 32728 64-bit fix for Time::Local

@INC for perl 5.10.0: /opt/ActivePerl-5.10/site/lib /opt/ActivePerl-5.10/lib .

Environment for perl 5.10.0: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset)

PATH=/opt/ActivePerl-5.10/bin:/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh

p5pRT commented 15 years ago

From @moritz

karl williamson wrote:

# New Ticket Created by karl williamson # Please include the string: [perl #58182] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=58182 >

This is a bug report for perl from corporate@khwilliamson.com\, generated with the help of perlbug 1.36 running under perl 5.10.0.

----------------------------------------------------------------- Characters in the range U+0080 through U+00FF behave inconsistently depending on whether or not they are part of a string which also includes a character above that range\, and in some cases they behave incorrectly even when part of such a string. The problems I will concentrate on in this report are those involving case.

I presume that they do work properly when a locale is set\, but I haven't tested that.

print uc("\x{e0}")\, "\n"; # (a with grave accent)

yields itself instead of a capital A with grave accent (U+00C0). This is true whether or not the character is part of a string which includes a character not storable in a single byte. Similarly

This is a known bug\, and probably not fixable\, because too much code depends on it. See http://search.cpan.org/perldoc?Unicode::Semantics

A possible workaround is my $x = "\x{e0}"; utf8::upgrade($x); say uc($x); # yields À

CHeers\, Moritz

p5pRT commented 15 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 15 years ago

From @druud62

karl williamson schreef:

The behavior that is inconsistent mostly occurs with upper case letters being mapped to lower case.

print lcfirst("\x{c0}aaaaa")\, "\n";

doesn't change the first character. But

print lcfirst("\x{c0}aaaaa\x{101}")\, "\n";

does change it.

To me that is as expected.

print lcfirst substr "\x{100}\x{c0}aaaaa"\, 1;

Lowercasing isn't defined for as many characters in ASCII or Latin-1 as it is in Unicode. Unicode semantics get activated when a codepoint above 255 is involved.

-- Affijn\, Ruud

"Gewoon is een tijger."

p5pRT commented 15 years ago

From @nothingmuch

On Thu\, Aug 21\, 2008 at 13:22:36 +0200\, Dr.Ruud wrote:

Unicode semantics get activated when a codepoint above 255 is involved.

Or a code point above 127 with use utf8 or use encoding

-- Yuval Kogman \nothingmuch@woobling\.org http://nothingmuch.woobling.org 0xEBD27418

p5pRT commented 15 years ago

From @nothingmuch

On Thu\, Aug 21\, 2008 at 14:31:40 +0300\, Yuval Kogman wrote:

Or a code point above 127 with use utf8 or use encoding

I should clarify that this is only in the context of the string constants.

A code point above 127 will be treated as unicode if the string is properly marked as such\, and the way to achieve that for string constants is 'use utf8'.

-- Yuval Kogman \nothingmuch@woobling\.org http://nothingmuch.woobling.org 0xEBD27418

p5pRT commented 15 years ago

From @khwilliamson

I'm the person who submitted this bug report. I think this bug should be fixed in Perl 5\, and I'm volunteering to do it. Towards that end\, I downloaded the Perl 5.10 source and hacked up an experimental version that seems to fix it. And now I've joined this list to see how to proceed. I don't know the protocol involved\, so I'll just jump in\, and hopefully that will be all right.

To refresh your memory\, the current implementation of perl on non-EBCDIC machines is problematic for characters in the range 128-255 when no locale is set.

The slides from the talk "Working around *the* Unicode bug" during YAPC::Europe 2007 in Vienna: http://juerd.nl/files/slides/2007yapceu/unicodesemantics.html give more cases of problems than were in my bug report.

The crux of the problem is that on non-EBCDIC machines\, in the absence of locale\, in order to have meaningful semantics\, a character (or code point) has to be stored in utf8\, except in pattern matching the \h\, \H\, \v and \V or any of the \p{} patterns. (This leads to an anomaly with the no-break space which is considered to be horizontal space (\h)\, but not space (\s).) (The characters also always have base semantics of having an ordinal number\, and also of being not-a-anything (meaning that they all pattern match \W\, \D\, \S\, [[:^punct]]\, etc.))

Perl stores characters as utf8 automatically if a string contains any code points above 255\, and it is trivially true for ascii code points. That leaves a hole-in-the-doughnut of characters between 128 and 255 with behavior that varies depending on whether they are stored as utf8 or not. This is contrary\, for example\, to the Camel book: "character semantics are preserved at an abstract level regardless of representation" (p.403). (How they get stored depends on how they were input\, or whether or not they are part of a longer string containing code points larger than 255\, or if they have been explicitly set by using utf8::upgrade or utf8::downgrade.)

I know of three areas where this leads to problems.

The first is the pattern matching already alluded to. This is at least documented (though somewhat confusingly). And one can use the \p{} constructs to avoid the issue.

The second is case changing functions\, like lcfirst() or \U in pattern substitutions.

And the third is ignoring case in pattern matches.

There may be others which I haven't looked for yet. I think\, for example\, that quotemeta() will escape all these characters\, though I don't believe that this causes a real problem.

One response I got to my bug report was that a lot of code depends on things working the way they currently do. I'm wondering if that applies to all three of the areas\, or just the first?

Also\, from reading the perl source\, it appears to me that EBCDIC machines may work differently (and more correctly to my way of thinking) than Ascii-ish ones.

An idea I've had is to add a pragma like "use latin1"\, or maybe "use locale unicode"\, or something else as a way of not breaking existing application code.

Anyway\, I'm hoping to get some sort of fix in for this. In my experimental implementation (which currently doesn't change EBCDIC handling)\, it is mostly just extending the existing definitions of ascii semantics to include the 128..255 latin1 range. Code logic changes were required only in the uc and ucfirst functions (to accommodate 3 characters which require special handling)\, and in the regular expression compilation (to accommodate 2 characters which need special handling). Obviously\, in my ignorance\, I may be missing things that others can enlighten me on.

So I'd like to know how to proceed

Karl Williamson

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 9/20/2008 3:52 PM\, came the following characters from the keyboard of karl williamson:

I'm the person who submitted this bug report. I think this bug should be fixed in Perl 5\, and I'm volunteering to do it. Towards that end\, I downloaded the Perl 5.10 source and hacked up an experimental version that seems to fix it. And now I've joined this list to see how to proceed. I don't know the protocol involved\, so I'll just jump in\, and hopefully that will be all right.

To refresh your memory\, the current implementation of perl on non-EBCDIC machines is problematic for characters in the range 128-255 when no locale is set.

The slides from the talk "Working around *the* Unicode bug" during YAPC::Europe 2007 in Vienna: http://juerd.nl/files/slides/2007yapceu/unicodesemantics.html give more cases of problems than were in my bug report.

The crux of the problem is that on non-EBCDIC machines\, in the absence of locale\, in order to have meaningful semantics\, a character (or code point) has to be stored in utf8\, except in pattern matching the \h\, \H\, \v and \V or any of the \p{} patterns. (This leads to an anomaly with the no-break space which is considered to be horizontal space (\h)\, but not space (\s).) (The characters also always have base semantics of having an ordinal number\, and also of being not-a-anything (meaning that they all pattern match \W\, \D\, \S\, [[:^punct]]\, etc.))

Perl stores characters as utf8 automatically if a string contains any code points above 255\, and it is trivially true for ascii code points. That leaves a hole-in-the-doughnut of characters between 128 and 255 with behavior that varies depending on whether they are stored as utf8 or not. This is contrary\, for example\, to the Camel book: "character semantics are preserved at an abstract level regardless of representation" (p.403). (How they get stored depends on how they were input\, or whether or not they are part of a longer string containing code points larger than 255\, or if they have been explicitly set by using utf8::upgrade or utf8::downgrade.)

I know of three areas where this leads to problems.

The first is the pattern matching already alluded to. This is at least documented (though somewhat confusingly). And one can use the \p{} constructs to avoid the issue.

The second is case changing functions\, like lcfirst() or \U in pattern substitutions.

And the third is ignoring case in pattern matches.

There may be others which I haven't looked for yet. I think\, for example\, that quotemeta() will escape all these characters\, though I don't believe that this causes a real problem.

One response I got to my bug report was that a lot of code depends on things working the way they currently do. I'm wondering if that applies to all three of the areas\, or just the first?

Also\, from reading the perl source\, it appears to me that EBCDIC machines may work differently (and more correctly to my way of thinking) than Ascii-ish ones.

An idea I've had is to add a pragma like "use latin1"\, or maybe "use locale unicode"\, or something else as a way of not breaking existing application code.

Anyway\, I'm hoping to get some sort of fix in for this. In my experimental implementation (which currently doesn't change EBCDIC handling)\, it is mostly just extending the existing definitions of ascii semantics to include the 128..255 latin1 range. Code logic changes were required only in the uc and ucfirst functions (to accommodate 3 characters which require special handling)\, and in the regular expression compilation (to accommodate 2 characters which need special handling). Obviously\, in my ignorance\, I may be missing things that others can enlighten me on.

So I'd like to know how to proceed

Karl Williamson

I applaud your willingness to dive in.

For compatibility reasons\, as has been discussed on this list previously\, a pragma of some sort must be used to request the incompatible enhancement (which you call a fix).

N.B. There are lots of discussions about it in the archive\, some recently\, if you haven't found them\, you should; if you find it hard to find them\, ask\, and I (or someone) will try to find the starting points for you\, perhaps the summaries would be a good place to look to find the discussions; I participated in most of them.

Those discussions are lengthy reading\, unfortunately\, but they do point out an extensive list of issues\, perhaps approaching completeness.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @andk

On Sat\, 20 Sep 2008 16:52:02 -0600\, karl williamson \contact@khwilliamson\.com said:

> I'm the person who submitted this bug report. I think this bug should > be fixed in Perl 5\, and I'm volunteering to do it. Towards that end\, I > downloaded the Perl 5.10 source and hacked up an experimental version > that seems to fix it. And now I've joined this list to see how to > proceed. I don't know the protocol involved\, so I'll just jump in\, and > hopefully that will be all right.

Thank you! As for the protocol: do not patch 5.10\, patch bleadperl instead.

-- andreas

p5pRT commented 15 years ago

From @rgs

2008/9/21 karl williamson \contact@khwilliamson\.com:

The crux of the problem is that on non-EBCDIC machines\, in the absence of locale\, in order to have meaningful semantics\, a character (or code point) has to be stored in utf8\, except in pattern matching the \h\, \H\, \v and \V or any of the \p{} patterns. (This leads to an anomaly with the no-break space which is considered to be horizontal space (\h)\, but not space (\s).) (The characters also always have base semantics of having an ordinal number\, and also of being not-a-anything (meaning that they all pattern match \W\, \D\, \S\, [[:^punct]]\, etc.))

Perl stores characters as utf8 automatically if a string contains any code points above 255\, and it is trivially true for ascii code points. That leaves a hole-in-the-doughnut of characters between 128 and 255 with behavior that varies depending on whether they are stored as utf8 or not. This is contrary\, for example\, to the Camel book: "character semantics are preserved at an abstract level regardless of representation" (p.403). (How they get stored depends on how they were input\, or whether or not they are part of a longer string containing code points larger than 255\, or if they have been explicitly set by using utf8::upgrade or utf8::downgrade.)

I know of three areas where this leads to problems.

The first is the pattern matching already alluded to. This is at least documented (though somewhat confusingly). And one can use the \p{} constructs to avoid the issue.

The second is case changing functions\, like lcfirst() or \U in pattern substitutions.

And the third is ignoring case in pattern matches.

There may be others which I haven't looked for yet. I think\, for example\, that quotemeta() will escape all these characters\, though I don't believe that this causes a real problem.

This is a good summary of the issues.

One response I got to my bug report was that a lot of code depends on things working the way they currently do. I'm wondering if that applies to all three of the areas\, or just the first?

In general\, one finds that people write code relying on almost anything...

Also\, from reading the perl source\, it appears to me that EBCDIC machines may work differently (and more correctly to my way of thinking) than Ascii-ish ones.

That's in theory probable\, but we don't have testers on EBCDIC machines those days...

An idea I've had is to add a pragma like "use latin1"\, or maybe "use locale unicode"\, or something else as a way of not breaking existing application code.

I think that the current Unicode bugs are annoying enough to deserve an incompatible change in perl 5.12. However\, for perl 5.10.x\, something could be added to switch to a more correct behaviour\, if possible without slowing everything down...

Anyway\, I'm hoping to get some sort of fix in for this. In my experimental implementation (which currently doesn't change EBCDIC handling)\, it is mostly just extending the existing definitions of ascii semantics to include the 128..255 latin1 range. Code logic changes were required only in the uc and ucfirst functions (to accommodate 3 characters which require special handling)\, and in the regular expression compilation (to accommodate 2 characters which need special handling). Obviously\, in my ignorance\, I may be missing things that others can enlighten me on.

So I'd like to know how to proceed

If you're a git user\, you can work on a branch cloned from git://perl5.git.perl.org/perl.git

Do not hesitate to ask questions here.

p5pRT commented 15 years ago

From @Juerd

Moritz Lenz skribis 2008-08-21 9:50 (+0200):

This is a known bug\, and probably not fixable\, because too much code depends on it.

It is fixable\, and the backwards incompatibility has already been announced in perl5100delta:

| The handling of Unicode still is unclean in several places\, where it's | dependent on whether a string is internally flagged as UTF-8. This will | be made more consistent in perl 5.12\, but that won't be possible without | a certain amount of backwards incompatibility.

It will be fixed\, and it's wonderful to have a volunteer for that! -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

Juerd Waalboer: Perl hacker \\#\#\#\#\#@juerd\.nl \<http://juerd.nl/sig> Convolution: ICT solutions and consultancy \sales@convolution\.nl 1;

p5pRT commented 15 years ago

From @Juerd

Dr.Ruud skribis 2008-08-21 13:22 (+0200):

Unicode semantics get activated when a codepoint above 255 is involved.

No\, unicode semantics get activated when the internal encoding of the string is utf8\, even if it contains no character above 255\, and even if it only contains ASCII characters.

It's a bug. A known and old bug\, but it must be fixed some time. -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

Juerd Waalboer: Perl hacker \\#\#\#\#\#@juerd\.nl \<http://juerd.nl/sig> Convolution: ICT solutions and consultancy \sales@convolution\.nl 1;

p5pRT commented 15 years ago

From @Juerd

karl williamson skribis 2008-09-20 16:52 (-0600):

One response I got to my bug report was that a lot of code depends on things working the way they currently do. I'm wondering if that applies to all three of the areas\, or just the first?

All three\, but rest assured that this has already been discussed in great detail\, and that the pumpking's decision was that backwards incompatibility would be better than keeping the bug.

This decision is clearly reflected in perl5100delta:

Please proceed with fixing the bug. I am very happy with your offer to smash this one.

Also\, from reading the perl source\, it appears to me that EBCDIC machines may work differently (and more correctly to my way of thinking) than Ascii-ish ones.

As always\, I refrain from thinking about EBCDIC. I'd say: keep the current behavior for EBCDIC platforms - there haven't been *any* complaints from them as far as I've heard.

An idea I've had is to add a pragma like "use latin1"\, or maybe "use locale unicode"\, or something else as a way of not breaking existing application code.

Please do break existing code\, harsh as that may be. It is much more likely that broken code magically starts working correctly\, by the way.

Pragmas have problems\, especially in regular expressions. And it's very hard to load a pragma conditionally\, which makes writing version portable code hard. Besides that\, any pragma affecting regex matches needs to be carried in qr//\, which in this case means new regex flags to indicate the behavior for (?i:...). According to dmq\, adding flags is hard.

Obviously\, in my ignorance\, I may be missing things that others can enlighten me on.

Please feel free to copy the unit tests in Unicode::Semantics! -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

Juerd Waalboer: Perl hacker \\#\#\#\#\#@juerd\.nl \<http://juerd.nl/sig> Convolution: ICT solutions and consultancy \sales@convolution\.nl 1;

p5pRT commented 15 years ago

From @Juerd

Glenn Linderman skribis 2008-09-20 16:31 (-0700):

For compatibility reasons\, as has been discussed on this list
previously\, a pragma of some sort must be used to request the
incompatible enhancement (which you call a fix).

As the current behavior is a bug\, the enhancement can rightfully be called a fix.

What's this about the pragma that "must be used"? Yes\, it has been discussed\, but no consensus has pointed in that direction.

In fact\, perl5100delta clearly announces backwards incompatibility. -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

Juerd Waalboer: Perl hacker \\#\#\#\#\#@juerd\.nl \<http://juerd.nl/sig> Convolution: ICT solutions and consultancy \sales@convolution\.nl 1;

p5pRT commented 15 years ago

From @druud62

Juerd Waalboer schreef:

Dr.Ruud:

Unicode semantics get activated when a codepoint above 255 is involved.

No\, unicode semantics get activated when the internal encoding of the string is utf8\, even if it contains no character above 255\, and even if it only contains ASCII characters.

Yes\, Unicode semantics get activated when a codepoint above 255 is involved.

Yes\, there are other ways too\, like:

perl -Mstrict -Mwarnings -Mencoding=utf8 -le' my $s = chr(65); print utf8::is_utf8($s); ' 1

-- Affijn\, Ruud

"Gewoon is een tijger."

p5pRT commented 15 years ago

From @ikegami

On Sat\, Sep 20\, 2008 at 6:52 PM\, karl williamson \contact@khwilliamson\.comwrote:

There may be others which I haven't looked for yet. I think\, for example\, that quotemeta() will escape all these characters\, though I don't believe that this causes a real problem.

There are inconsistencies with quotemeta (and therefore \Q)

perl -wle"utf8::downgrade( $x = chr(130) ); print quotemeta $x" \é

perl -wle"utf8::upgrade( $x = chr(130) ); print quotemeta $x" é

p5pRT commented 15 years ago

From @iabyn

On Mon\, Sep 22\, 2008 at 09:55:23PM +0200\, Juerd Waalboer wrote:

It's a bug. A known and old bug\, but it must be fixed some time.

Here's a general suggestion related to fixing Unicode-related issues.

A well-known issue is that the SVf_UTF8 flag means two different things:

1) whether the 'sequence of integers' are stored one per byte\, or use the variable-length utf-8 encoding scheme;

2) what semantics apply to that sequence of integers.

We also have various bodges\, such as attaching magic to cache utf8 indexes.

All this stems from the fact that there's no space in an SV to store all the information we want. So....

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an Extended String flag. This flag indicates that prepended to the SvPVX string is an auxilliary structure (cf the hv_aux struct) that contains all the extra needed unicodish info\, such as encoding\, charset\, locale\, cached indexes etc etc. This then both allows us to disambiguate the meaning of SVf_UTF8 (in the aux structure there would be two different flags for the two meanings)\, but would also provide room for future enhancements (eg space for a UTF32 flag should someone wish to implement that storage format).

Just a thought...

-- "I do not resent criticism\, even when\, for the sake of emphasis\, it parts for the time with reality". -- Winston Churchill\, House of Commons\, 22nd Jan 1941.

p5pRT commented 15 years ago

From @Juerd

Dave Mitchell skribis 2008-09-23 17:03 (+0100):

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an Extended String flag. This flag indicates that prepended to the SvPVX string is an auxilliary structure (cf the hv_aux struct) that contains all the extra needed unicodish info\, such as encoding\, charset\, locale\, cached indexes etc etc.

It sounds rather complicated\, whereas the current plan would be to continue with the single bit flag\, and only remove one of its meanings. -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

Juerd Waalboer: Perl hacker \\#\#\#\#\#@juerd\.nl \<http://juerd.nl/sig> Convolution: ICT solutions and consultancy \sales@convolution\.nl 1;

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 9/23/2008 9:58 AM\, came the following characters from the keyboard of Juerd Waalboer:

Dave Mitchell skribis 2008-09-23 17:03 (+0100):

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an Extended String flag. This flag indicates that prepended to the SvPVX string is an auxilliary structure (cf the hv_aux struct) that contains all the extra needed unicodish info\, such as encoding\, charset\, locale\, cached indexes etc etc.

It is not at all clear to me that encoding\, charset\, and locale are Unicodish info... Unicode frees us from such stuff\, except at boundary conditions\, where we must deal with devices or formats that have limitations. This extra information seems more appropriately bound to file/device handles than to strings.

Cached indexes are a nice performance help\, I don't know enough about the internals to know if reworking them from being done as magic\, to being done in some frightfully (in thinking of XS) new structure would be an overall win or loss.

It sounds rather complicated\, whereas the current plan would be to continue with the single bit flag\, and only remove one of its meanings.

I guess Juerd is referring to removing any semantic meaning of the flag\, and leaving it to simply be a representational flag? That representational flag would indicate that the structure of the string is single-byte oriented (no individual characters exceed a numeric value of 255)\, or multi-bytes oriented (characters may exceed a numeric value of 255\, and characters greater than a numeric value of 127 will be stored in multiple\, sequential bytes).

After such a removal\, present-perl would reach the idyllic state (idyllic-perl) of implementing only Unicode semantics for all string operations. (Even the EBCDIC port should reach that idyllic state\, although it would use a different encoding of numbers to characters\, UTF-EBCDIC instead of UTF-8.) If other encodings are desired/used\, there would be two application approaches to dealing with it:

1) convert all other encodings to Unicode\, perform semantic operations as needed\, convert the results to some other encoding. This is already the recommended approach\, although present-perl's attempt to retain the single-byte oriented representational format as much as possible presently makes this a bit tricky.

2) leave data in other encodings\, but avoid the use of Perl operations that apply Unicode semantics in manners that are inconsistent with the semantics of the other encoding. Write specific code to implement the semantics of the other encoding as needed\, without doing the re-coding. This could be somewhat error prone\, but could be achieved\, since\, after all\, strings are simply an ordered list of numbers\, to which any application semantics that are desired can be applied. Idyllic-perl simply provides a fairly large collection of string operations that have Unicode semantics\, which are inappropriate for use with strings having other semantics.

Note that binary data in strings is simply a special case of strings with non-Unicode semantics... In present-perl\, there are three sets of string semantics selected by the representation\, ASCII (operations like character classes and case shifting)\, Latin-1 (the only operation that supports Latin-1 semantics is the conversion from single-byte representation to multi-byte representation)\, and Unicode (operations like character classes and case shifting). It is already inappropriate to apply operations that imply ASCII or Unicode semantics to binary strings of either representation. Applying the representation conversion operation to binary data is perfectly legal\, and doesn't change the binary values in any way... but is generally not a mental shift that most programmers wish to make in dealing with binary data--most prefer their binary data to remain in the single-byte oriented representation\, and they are welcome to code in such a manner that they do.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @khwilliamson

Glenn\,

The reason I called it a bug is that I\, an experienced Perl programmer\, attempted to enhance an application to understand unicode. I read the Camel book and the on-line documentation and came to a very different expectation as to how it worked than it does in reality. I then thought I was scouring the documentation when things went wrong\, and still didn't get it. It was only after a lot of experimentation and some internet searches that I started to see the cause of the problem. I was using 5.8.8; perhaps the documentation has changed in 5.10. And perhaps my own expectations of how I thought it should work caused me to be blind to things in the documentation that were contrary to my preconceived notions.

Whatever one calls it\, there does seem to be some support for changing the behavior. After reading your response and further reflection\, I think that Goal #1 of not breaking old programs is contradictory to the other ones. Indeed\, a few regression tests fail with my experimental implementation. Some of them are commented that they are taking advantage of the anomaly to verify that the operation they test doesn't change the utf8-ness of the data. Others explicitly are testing that\, for example\, taking lc(E with an accent) returns itself unless an appropriate locale is specified. I doubt that the code that test was for really cares\, but if so\, why put in the test? There are a couple of failures which are obtuse\, and uncommented\, so I haven't yet tried to figure out what was going on. I wanted to see if I should proceed at all before doing so.

I have looked in the archive and found some discussions about this problem\, but certainly not a lot. Please let me know of ones you think important that I read.

Karl Williamson

Glenn Linderman wrote:

For compatibility reasons\, as has been discussed on this list previously\, a pragma of some sort must be used to request the incompatible enhancement (which you call a fix).

N.B. There are lots of discussions about it in the archive\, some recently\, if you haven't found them\, you should; if you find it hard to find them\, ask\, and I (or someone) will try to find the starting points for you\, perhaps the summaries would be a good place to look to find the discussions; I participated in most of them.

Those discussions are lengthy reading\, unfortunately\, but they do point out an extensive list of issues\, perhaps approaching completeness.

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 9/23/2008 10:33 AM\, came the following characters from the keyboard of karl williamson:

Glenn\,

The reason I called it a bug is that I\, an experienced Perl programmer\, attempted to enhance an application to understand unicode. I read the Camel book and the on-line documentation and came to a very different expectation as to how it worked than it does in reality.

The behavior is non-obvious. I may be blind to the deficiencies of the documentation\, because of knowing roughly how it works\, due to hanging out on p5p too long :) It has been an open discussion about whether it is working as designed (with lots of gotchas for the application programmer)\, or whether the design\, in fact\, is the bug. Seems Raphael has declared it to be a bug in the 5.10 release notes\, and something that can/should be incompatibly changed/fixed for 5.12\, but I missed that declaration. Any solution for 5.8.x or 5.10.x\, though\, would have be treated as an enhancement\, turned on by a pragma\, because the current design\, buggy or not\, is the current design for which applications are coded.

I then thought I was scouring the documentation when things went wrong\, and still didn't get it. It was only after a lot of experimentation and some internet searches that I started to see the cause of the problem. I was using 5.8.8; perhaps the documentation has changed in 5.10. And perhaps my own expectations of how I thought it should work caused me to be blind to things in the documentation that were contrary to my preconceived notions.

The documentation has been in as much flux as the code\, from 5.6.x to 5.8.x to 5.10.x. Unfortunately\, there are enough warts in the design that it is hard to find all the places where the documentation should be clarified. My most recent message to p5p clarifies what I think is the idyllic state that I hope is the one that you share\, and will achieve for 5.12 (or for a pragma-enabled 5.10) redesign/bug-fix.

Whatever one calls it\, there does seem to be some support for changing the behavior. After reading your response and further reflection\, I think that Goal #1 of not breaking old programs is contradictory to the other ones.

Yes\, there is a definite conflict between those goals\, and from that conflict arises many of the behaviours that are not expected by reasonable programmers when designing their application code.

Indeed\, a few regression tests fail with my experimental implementation. Some of them are commented that they are taking advantage of the anomaly to verify that the operation they test doesn't change the utf8-ness of the data. Others explicitly are testing that\, for example\, taking lc(E with an accent) returns itself unless an appropriate locale is specified. I doubt that the code that test was for really cares\, but if so\, why put in the test? There are a couple of failures which are obtuse\, and uncommented\, so I haven't yet tried to figure out what was going on. I wanted to see if I should proceed at all before doing so.

Sure. Please proceed. Especially with Raphael's openness to incompatible changes in this area for 5.12\, it would be possible to remove all of the warts\, conflicts\, and unexpected behaviours. Of course\, incompatible changes are always considered for major releases\, but not always accepted. But this area seems to have a green light.
The current situation is very painful\, compared to other languages that implement Unicode. The compatibility issue was very real\, however\, when the original design was done\, no doubt partly due to Perl's extensive CPAN collection\, particularly the XS part of that CPAN collection. Some of that concern has been alleviated due to enhancements to the XS code in the intervening years\, although no doubt you may encounter bugs in some of those enhancements\, also.

I have looked in the archive and found some discussions about this problem\, but certainly not a lot. Please let me know of ones you think important that I read.

The discussions are more lengthy (per post\, and per number of posts)\, than numerous (by thread count)... and contain more heat than light\, often. Perhaps you've found them all.

Given Raphael's green light\, and if you are pointed at changes to Perl 5.12\, the most important thing is to cover all the relevant operations\, so that all string operations apply Unicode semantics to all their operands\, regardless of their representational format.

Here is one thread:

Subject: "on the almost impossibility to write correct XS modules"
started by Marc Lehmann on April 25\, 2008\, and lasted with that subject line until at least May 22! So almost a whole month!

demerphq spawned a related thread subject: "On the problem of strings and binary data in Perl." on May 20\, 2008. This attempted to deal with multi-lingual strings; there is more to the issue of proper handling of multi-lingual strings than being able to represent all the characters that each one uses\, but that is a very specialized type of program; at least being able to represent the characters is a good start; being able to pass "language" as an operand to certain semantic operations would be good (implicitly\, via locale\, or explicitly\, via a parameter).

Another related issue is that various operations that attempt to implement Unicode semantics don't go the whole way\, and have interesting semantics for when strings (even strings represented in multi-byte format) don't actually contain Unicode. Idyllic-perl should have chr/ord as simple ways to convert between numbers and characters\, and not burden them with any sort of Unicode semantics. See bug #51936\, and the p5p thread it spawned (search for the bug number in the archives). See also bug #51710 and the threads it spawned\, about utf8_valid. While utf8_valid probably should be enhanced\, its existence is probably reasonable justification to not burden chr/ord with Unicode validity checks.

Let's not forget Pack & Unpack. There's one thread about that with Subject: Perl 5.8 and perl 5.10 differences on UTF/Pack things started in June 18\, 2008... and a much older one started by Marc Lehmann (not sure what that subject line was\, but it resulted in a fairly recent change to Pack\, by Marc).

Other related threads have the following subject lines: use encoding 'utf8' bug for Latin-1 range proposed change in utf8 filename semantics Compress::Zlib\, pack "C" and utf-8 Smack! (this spawned some other threads that left Smack! in their subjects\, but which were added to) perl\, the data\, and the tf8 flag the utf8 flag encoding neutral unpack

The philosophy should be that no Perl operations should have different semantics based on the representation of the string being single-byte or multi-byte format. Operations and places to watch out for include (one of these threads attempted a complete list of operations that had different semantics\, this is my memory of some of them):

String constant metacharacters such as \u \U \l \L case shifting code such as uc & lc regexp case insensitivity and character classes chr/ord utf8::is_valid

pack/unpack - packing should always produce a single-byte string\, and unpack should generally expect a single-byte string... but if\, for some reason\, unpack is handed a multi-byte string\, it should not pretend it really should have been a single-byte string\, but instead\, it should interpret the string as input characters. If there are any input characters actually greater than 255\, this should probably be considered a bug\, because pack doesn't produce such. Perhaps Marc's fix was the last issue along that line for unpack...

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @rgs

2008/9/23 Dave Mitchell \davem@iabyn\.com:

How about we remove the SVf_UTF8 flag from SvFLAGS and replace it with an Extended String flag. This flag indicates that prepended to the SvPVX string is an auxilliary structure (cf the hv_aux struct) that contains all the extra needed unicodish info\, such as encoding\, charset\, locale\, cached indexes etc etc.

I don't think we want to store the charset/locale with the string.

Consider the string "istanbul". If you're treating this string as English\, you'll capitalize it as "ISTANBUL"\, but if you want to follow the Stambouliot spelling\, it's "İSTANBUL".

Now consider the string "Consider the string "istanbul"". Shall we capitalize it as "CONSİDER THE STRİNG "İSTANBUL"" ? Obviously attaching a language to a string is going to be a problem when you have to handle multi-language strings.

So the place that makes sense to provide this information is\, in my opinion\, in lc and uc (and derivatives): in the code\, not the data. (So a pragma can be used\, too.)

p5pRT commented 15 years ago

From @khwilliamson

I have been studying some of the discussions in this group about this problem\, and find them overwhelming. So\, I'm going to just put forth a simple straw proposal that doesn't address a number of the things that people were talking about\, but does solve a lot of things.

This is a very concrete proposal\, and I would like to get agreement on the semantics involved: There will be a new mode of operation which will be enabled or disabled by means yet to be decided. When enabled\, the new behavior will be that a character in a scalar or pattern will have the same semantics whether or not it is stored as utf8. The operations that are affected are lc()\, lcfirst()\, uc()\, ucfirst()\, quotemeta() and patten matching\, (including \U\, \u\, \L\, and \l\, and matching of things like \w\, [[:punct:]]). This is effectively what would happen if we were operating under an iso-8859-1 locale with the following modifications to get full unicode semantics: 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL LETTER Y WITH DIAERESIS. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff. 2) uc(MICRO SIGN) will be GREEK CAPITAL LETTER MU. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff. 3) uc(LATIN SMALL LETTER SHARP S) will be a string consisting of LATIN CAPITAL LETTER S followed by itself; ie\, 'SS'. Same for \U in pattern substitutions. The result will have the utf8-ness as the original. 4) ucfirst(LATIN SMALL LETTER SHARP S) will be a string consisting of the two characters LATIN CAPITAL LETTER S followed by LATIN SMALL LETTER S; ie\, 'Ss'. Same for \u in pattern substitutions. The result will have the utf8-ness as the original. 5) If the MICRO SIGN is in a pattern with case ignored\, it will match itself and both GREEK CAPITAL LETTER MU and GREEK SMALL LETTER MU. 6) If the LATIN SMALL LETTER SHARP S is in a pattern with case ignored\, it will match itself and any of 'SS'\, 'Ss'\, 'ss'. 7) If the LATIN SMALL LETTER Y WITH DIAERESIS is in a pattern with case ignored\, it will match itself and LATIN CAPITAL LETTER Y WITH DIAERESIS

This mode would not impose a compile-time latin1-like locale on the perl program. For example\, whether perl identifiers could have a LATIN SMALL LETTER Y WITH DIAERESIS in them or not would not be affected by this mode

I do not propose to automatically convert ("downgrade") strings from utf8 to latin1 when utf8 is not needed. For example\, lc(LATIN CAPITAL LETTER Y WITH DIAERESIS) would return a string still in utf8-encoding

I don't know what to do about EBCDIC machines. I propose leaving them to work the way they currently do.

I don't know what to do about interacting with "use bytes". One option is for them to be mutually incompatible\, that is\, if you turn one on\, it turns the other off. Another option is if both are in effect that it would be exactly the same as if a latin1 run-time locale was set\, without any of the modifications listed above.

Are there other interactions that we need to worry about?

I would like to defer how this mode gets enabled or disabled until we agree on the semantics of what happens when it is enabled.

I think that a number of the issues that have been raised in the past are in some way independent of this proposal. We may want to do some of them\, but should we do at least this much\, or not?

p5pRT commented 15 years ago

From vadim@vkonovalov.ru

karl williamson wrote:

I have been studying some of the discussions in this group about this problem\, and find them overwhelming. So\, I'm going to just put forth a simple straw proposal that doesn't address a number of the things that people were talking about\, but does solve a lot of things.

This is a very concrete proposal\, and I would like to get agreement on the semantics involved: There will be a new mode of operation which will be enabled or disabled by means yet to be decided. When enabled\, the new behavior will be that a character in a scalar or pattern will have the same semantics whether or not it is stored as utf8. The operations that are affected are lc()\, lcfirst()\, uc()\, ucfirst()\, quotemeta() and patten matching\, (including \U\, \u\, \L\, and \l\, and matching of things like \w\, [[:punct:]]). This is effectively what would happen if we were operating under an iso-8859-1 locale

what the "under an iso-8859-1 locale" exactly?

reading perllocale gives me:

USING LOCALES The use locale pragma

By default\, Perl ignores the current locale. The "use locale" pragma tells Perl to use the current locale for some operations:

Do I understand correctly that your proposal will never touch me provided that I never do "use locale;"? You do not mean posix locale\, don't you?

Do I remember correctly that using locales is not recommended in Perl?

with the following modifications to get full unicode semantics: 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL LETTER Y WITH DIAERESIS. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff.

could you please be more precise with uc(blablabal)?

what you currently wrote is a syntax error

....

Best regards\, Vadim.

p5pRT commented 15 years ago

From @khwilliamson

What I meant is not a literal locale\, but that the semantics would be the same as iso-8859-1 characters but with the listed modifications. I was trying to avoid listing all the 8859-1 semantics. But in brief\, there are 128 characters above ascii in 8859-1\, and they each have semantics. 0xC0 for example is a latin capital letter A with a grave accent. Its lower case is 0xE0. If you are on a Un*x like system\, you can type 'man latin1' at a command line prompt to get the entire list. It doesn't however say which things are punctuation\, which are word characters\, etc. But they are the same in Unicode\, so the Unicode standard lists all of them. Characters that are listed in the man page that are marked as capital all have corresponding lower case versions that are easy to figure out by their names. The three characters I mentioned as modifications to get unicode are considered lower case and have either multiple character upper case versions\, or their upper case version is not in latin1

My proposal would touch you UNLESS you do have a 'use locale'. Your locale would override my proposal. In other words\, by specifying "use locale"\, my proposal would not touch your program. The documentation does say not to use locales\, but in looking at the code\, it appears to me that a locale takes precedence\, and does work ok. I believe that you can get many of the Perl glitches to go away by having a locale which specifies iso-8859-1. But I haven't actually tried it.

Vadim Konovalov wrote:

karl williamson wrote:

I have been studying some of the discussions in this group about this problem\, and find them overwhelming. So\, I'm going to just put forth a simple straw proposal that doesn't address a number of the things that people were talking about\, but does solve a lot of things.

This is a very concrete proposal\, and I would like to get agreement on the semantics involved: There will be a new mode of operation which will be enabled or disabled by means yet to be decided. When enabled\, the new behavior will be that a character in a scalar or pattern will have the same semantics whether or not it is stored as utf8. The operations that are affected are lc()\, lcfirst()\, uc()\, ucfirst()\, quotemeta() and patten matching\, (including \U\, \u\, \L\, and \l\, and matching of things like \w\, [[:punct:]]). This is effectively what would happen if we were operating under an iso-8859-1 locale

what the "under an iso-8859-1 locale" exactly?

reading perllocale gives me:

USING LOCALES The use locale pragma
  By default\, Perl ignores the current locale\.  The "use locale" 
pragma tells Perl to use the current locale for some operations:

Do I understand correctly that your proposal will never touch me provided that I never do "use locale;"? You do not mean posix locale\, don't you?

Do I remember correctly that using locales is not recommended in Perl?

with the following modifications to get full unicode semantics: 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL LETTER Y WITH DIAERESIS. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff.

could you please be more precise with uc(blablabal)?

what you currently wrote is a syntax error

....

Best regards\, Vadim.

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 9/26/2008 11:44 AM\, came the following characters from the keyboard of karl williamson:

I have been studying some of the discussions in this group about this problem\, and find them overwhelming. So\, I'm going to just put forth a simple straw proposal that doesn't address a number of the things that people were talking about\, but does solve a lot of things.

Yeah\, I gave you a lot of reading material. I hoped not to scare you off\, but I didn't want you be ignorant of the previous discussions\, do a bunch of work that only solved part of the problems\, and have it rejected because it wasn't a complete solution.

This is a very concrete proposal\, and I would like to get agreement on the semantics involved: There will be a new mode of operation which will be enabled or disabled by means yet to be decided.

This makes it sound like you are targeting 5.10.x; since you are talking about modes of operation. On the other hand\, if the implementation isn't significantly more complex than current code\, keeping the current behavior might be a safe approach\, even if somewhere\, somehow\, the new behavior decides to become the default behavior.

When enabled\, the new behavior will be that a character in a scalar or pattern will have the same semantics whether or not it is stored as utf8. The operations that are affected are lc()\, lcfirst()\, uc()\, ucfirst()\, quotemeta() and patten matching\, (including \U\, \u\, \L\, and \l\, and matching of things like \w\, [[:punct:]]).

This sounds like it might be a complete list of operations. I think \u\, \U\, \l\, and \L are more string interpolation operators rather than pattern matching operators\, but that is just terminology.

This is effectively what would happen if we were operating under an iso-8859-1 locale with the following modifications to get full unicode semantics: 1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL LETTER Y WITH DIAERESIS. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff. 2) uc(MICRO SIGN) will be GREEK CAPITAL LETTER MU. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff. 3) uc(LATIN SMALL LETTER SHARP S) will be a string consisting of LATIN CAPITAL LETTER S followed by itself; ie\, 'SS'. Same for \U in pattern substitutions. The result will have the utf8-ness as the original. 4) ucfirst(LATIN SMALL LETTER SHARP S) will be a string consisting of the two characters LATIN CAPITAL LETTER S followed by LATIN SMALL LETTER S; ie\, 'Ss'. Same for \u in pattern substitutions. The result will have the utf8-ness as the original. 5) If the MICRO SIGN is in a pattern with case ignored\, it will match itself and both GREEK CAPITAL LETTER MU and GREEK SMALL LETTER MU. 6) If the LATIN SMALL LETTER SHARP S is in a pattern with case ignored\, it will match itself and any of 'SS'\, 'Ss'\, 'ss'. 7) If the LATIN SMALL LETTER Y WITH DIAERESIS is in a pattern with case ignored\, it will match itself and LATIN CAPITAL LETTER Y WITH DIAERESIS

This mode would not impose a compile-time latin1-like locale on
the perl program. For example\, whether perl identifiers could have a LATIN SMALL LETTER Y WITH DIAERESIS in them or not would not be affected by this mode

These all sound like appropriate behaviors to implement for a Unicode semantics mode. However\, I wouldn't know (or particularly care) if it is a complete of differences between Latin-1 and Unicode semantics. I'm not at all interested in Latin-1 semantics. Today\, the operators you list all have ASCII semantics\, most everyone seems to agree that Unicode semantics would be preferred. Latin-1 semantics are only used in upgrade/downgrade operations\, at present. (Unless someone says use locale; which\, as you say\, is not recommended to use locales.)

I do not propose to automatically convert ("downgrade") strings from utf8 to latin1 when utf8 is not needed. For example\, lc(LATIN CAPITAL LETTER Y WITH DIAERESIS) would return a string still in utf8-encoding

Fine. All else being equal (utf8 just being a representation) it shouldn't make any difference.

I don't know what to do about EBCDIC machines. I propose leaving them to work the way they currently do.

Best effort non-breakage seems to be the best we can currently expect...

I don't know what to do about interacting with "use bytes". One option is for them to be mutually incompatible\, that is\, if you turn one on\, it turns the other off. Another option is if both are in effect that it would be exactly the same as if a latin1 run-time locale was set\, without any of the modifications listed above.

Another possibility would be that all the above listed operations would be noops or produce errors\, because they all imply Unicode character semantics\, whereas use bytes declares that the data is binary.

"\U\x45\x23\x37" should just be "\x45\x23\x37" for example of a noop.

Are there other interactions that we need to worry about?

Probably. Every XS writer under the sun has assumed different things about utf8 flag semantics\, I'm sure. So you should worry about handling the flakkk.

I would like to defer how this mode gets enabled or disabled until we agree on the semantics of what happens when it is enabled.

Sure\, but if you target 5.10.x you need some way of enabling or disabling. If you target 5.12\, enabling may happen because it is 5.12.

I think that a number of the issues that have been raised in the past are in some way independent of this proposal. We may want to do some of them\, but should we do at least this much\, or not?

It might be nice to recap anything that isn't being addressed\, at least in general terms\, so that someone doesn't "remember" it at the last minute\, and claim that your proposal is worthless without a solution in that area.

Unicode filename handling\, especially on Windows\, might be a contentious point\, as it is also basically broken. In fact\, once Perl has Unicode semantics for all strings\, then it would be basically appropriate for the Windows port to start using the "wide" UTF-16 APIs\, instead of the the "byte" APIs for all OS API calls. This might be a fair-size bullet to chew on\, but it would be extremely useful; today\, it is extremely difficult to write multilingual programs using Perl on Windows\, and the biggest culprit is the use of the 8-bit APIs\, with the _UNICODE (I think) define not being turned on when compiling perl and extensions. Enough that I have had to learn Python for a recent project. In large part\, one could claim that this is a Windows port issue\, not a core perl issue\, of course... there is no reason that the Windows port couldn't have already starting using wide APIs\, even with the limited Unicode support in perl proper... everyone (here at least) knows the kludges to use to get perl proper to use Unicode consistently enough to get work done\, but the I/O boundary on Windows is a real problem.

You'll need to give this proposal a week or so of discussion time before you can be sure that everyone that cares has commented\, or longer (perhaps much longer) if there is dissension. However\, I think a lot of the dissension has been beaten out in earlier discussions\, so perhaps the time is ripe that a fresh voice with motivation to make some fixes\, can actually make progress on this topic.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @Juerd

Hello Karl\,

I strongly agree with your proposed solutions. (I'm ambivalent only about the 4th: ucfirst "ß".)

Thank you for the summary.

karl williamson skribis 2008-09-26 12:44 (-0600):

1) uc(LATIN SMALL LETTER Y WITH DIAERESIS) will be LATIN CAPITAL LETTER Y WITH DIAERESIS. Same for ucfirst\, and \U and \u in pattern substitutions. The result will be in utf8\, since the capital letter is above 0xff.

"in utf8" is ambiguous. It can mean either length(uc($y_umlaut)) == 2 or is_utf8(uc($y_umlaut)). The former would be wrong\, the latter would be correct.

May I suggest including the words "upgrade" and "internal"?

The resulting string will be upgraded to utf8 internally\, ...

I don't know what to do about interacting with "use bytes". One option is for them to be mutually incompatible\, that is\, if you turn one on\, it turns the other off. (...) I would like to defer how this mode gets enabled or disabled until we
agree on the semantics of what happens when it is enabled.

Turning your solutions on explicitly is probably wrong\, at least for 5.12.

Using a pragma is problematic because of qr//\, and because it cannot be enabled conditionally (in any reasonably easy way).

I'd prefer to skip any discussion about how to enable or disable this - enable it by default and don't provide any way to disable it. -- Met vriendelijke groet\, Kind regards\, Korajn salutojn\,

Juerd Waalboer: Perl hacker \\#\#\#\#\#@juerd\.nl \<http://juerd.nl/sig> Convolution: ICT solutions and consultancy \sales@convolution\.nl 1;

p5pRT commented 15 years ago

From @druud62

karl williamson schreef:

I would like to defer how this mode gets enabled or disabled until we agree on the semantics of what happens when it is enabled.

use kurila; # ;-)

-- Affijn\, Ruud

"Gewoon is een tijger."

p5pRT commented 15 years ago

From vadim@vkonovalov.ru

Dr.Ruud wrote:

karl williamson schreef:

I would like to defer how this mode gets enabled or disabled until we agree on the semantics of what happens when it is enabled.

use kurila; # ;-)

kurila is so largely incompatible\, it is even off-topicable!

(initially I thought its on-topic but then I was convinced by responders it isn't and looking at direction it go it really is not on-topic on p5p)

BR\, Vadim.

p5pRT commented 15 years ago

From @khwilliamson

My proposal from a week and a half ago hasn't spawned much dissension--yet. I'll take that as a good sign\, and proceed.

Here's a hodge-podge of my thoughts about it\, but most important\, I am concerned about the enabling and disabling of this. I think there has to be some way to disable it in case current code has come to rely on what I call broken behavior.

It looks like in 5.12\, Rafael wants the new mode to be default behavior. But he also said that a switch could be added in 5.10.x to turn it on\, as long as performance doesn't suffer.

Glenn\, "use bytes" doesn't mean necessarily binary. For example\, use bytes; print lc('A')\, "\n";

prints 'a'. It does mean ASCII semantics even for utf8::upgraded strings.

If there is a way to en/dis-able this mode\, doesn't that have to be a pragma? Doesn't it have to be lexically scoped? And if the answers to these are yes\, what do we do with things that are created under one mode and then executed in the other?

Juerd wrote:

Pragmas have problems\, especially in regular expressions. And it's very hard to load a pragma conditionally\, which makes writing version portable code hard. Besides that\, any pragma affecting regex matches needs to be carried in qr//\, which in this case means new regex flags to indicate the behavior for (?i:...). According to dmq\, adding flags is hard.

I don't understand what you mean that pragmas have problems\, esp in re's. Please explain.

I had thought I had this solved for qr//i. The way I was planning to implement this for pattern matching is quite simple. First\, by changing the existing fold table definitions to include the Unicode semantics\, the pattern matching magically starts working without any code logic changes for all but two characters: the German sharp ss\, and the micron symbol. For these\, I was planning to use the existing mechanisms to compile the re as utf8\, so it wouldn't require any new flags. Thus qr// would be utf8 if it contained these two characters. And it works today to match such a pattern against both non-utf8 and utf8 strings. I haven't tested to see what happens when such a pattern is executed under use bytes. I was presuming it did something reasonable. But now I'm not so sure\, as I've found a number of bugs in the re code in my testing\, and some are of a nature that I don't feel comfortable with my level of knowledge about how it works to dive in and fix them. They should be fixed anyway\, and I'm hoping some expert will undertake that. I think that once they're fixed\, that I could extend them to work in the latin1 range quite easily. So the bottom line is that qr//i may or may not be a problem.

For the other interactions\, I'm not sure there is a problem. If one creates a string whether or not this mechanism is on\, it remains 8 bits\, unless it has a code point above 255. If one operates on it while this mechanism is on\, it gets unicode semantics\, which in a few cases irretrievably convert it to utf8 because the result is above 255. If one operates on it while this mechanism is off\, you get ASCII semantics. I don't really see a problem with that.

I think it would be easy to extend this to EBCDIC\, at least the three encodings perl has compiled-in tables for. The problem is that Rafael said that there's no one testing on EBCDIC machines\, so I couldn't know if it worked or not before releasing it.

I'm also thinking that the Windows file name problems can be considered independent of this\, and addressed at a later time.

I also agree with Glenn's and Juerd's wording changes.

I saw nothing in my reading of the code that would lead me to touch the utf8 flag's meaning. But I am finding weird bugs in which Perl apparently gets mixed up about the flag. These vanish if I rearrange the order of supposedly independent lines in the program. It looks like it could be a wild write. I wrote a bug report [perl #59378]\, but I think that the description of that is wrong.

So the bottom line for now\, is I'd like to get some consensus about how to turn it on and off (and whether to\, which I think the answer is there has to be a way to turn it off.) I guess I would claim that in 5.12\, "use bytes" could be used to turn it off. But that may be controversial\, and doesn't address backporting it.

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 10/6/2008 8:02 PM\, came the following characters from the keyboard of karl williamson:

My proposal from a week and a half ago hasn't spawned much dissension--yet. I'll take that as a good sign\, and proceed.

Here's a hodge-podge of my thoughts about it\, but most important\, I am concerned about the enabling and disabling of this. I think there has to be some way to disable it in case current code has come to rely on what I call broken behavior.

It looks like in 5.12\, Rafael wants the new mode to be default behavior. But he also said that a switch could be added in 5.10.x to turn it on\, as long as performance doesn't suffer.

Glenn\, "use bytes" doesn't mean necessarily binary. For example\, use bytes; print lc('A')\, "\n";

prints 'a'. It does mean ASCII semantics even for utf8::upgraded strings.

That interpretation could work\, however\, it is in conflict with the documented behavior of use bytes... use bytes is explicitly documented to work on the bytes of utf8 strings (thus making visible the individual bytes of the UTF8 encoding).

While as you demonstrate\, lc('A') is applied\, that seems like a bug to me; the documentation says "The use bytes pragma disables character semantics". On the other hand\, this documentation may simply be confusing -- it may actually mean only that utf8 strings are to be treated as bytes\, like other byte strings\, which may have binary or character semantics applied\, depending on the operator invoked.

I think it would be much more useful to prohibit operations that apply character semantics while in "use bytes" mode. chr should restrict input values to 0..255\, ord will only produce such. It is already documented that substr\, index\, and rindex work as byte operators. regexps compiled while use bytes is in effect should not support character sorts of operations. \w is meaningless on binary data\, for example\, although character classes (could be called byte classes) could still be useful without character semantics. I waffle on the regexp operations...

I doubt you'll find strong support for this position\, due to compatibility reasons\, but if we are going incompatible for Unicode support\, it seems that going incompatible on bytes support isn't much harder\, and could help find bugs. Of course\, if you can turn it off\, and regain compatibility...

If there is a way to en/dis-able this mode\, doesn't that have to be a pragma? Doesn't it have to be lexically scoped? And if the answers to these are yes\, what do we do with things that are created under one mode and then executed in the other?

For the strings themselves\, I think it is reasonable to apply the semantics in which they are executed. Regexps are a harder call. Your analysis below is interesting.

Juerd wrote:

Pragmas have problems\, especially in regular expressions. And it's very hard to load a pragma conditionally\, which makes writing version portable code hard. Besides that\, any pragma affecting regex matches needs to be carried in qr//\, which in this case means new regex flags to indicate the behavior for (?i:...). According to dmq\, adding flags is hard.

I don't understand what you mean that pragmas have problems\, esp in re's. Please explain.

The compilation of a regexp may be optimized based on the semantics then in place\, and may need to be able to preserve the semantics from the point of compilation to the point of use. It is certainly true that _if_ the regexp is optimized based on compilation semantics\, that some definition should be made about what it means to compile it with one semantics and use it when there are different semantics\, there are three choices: (1) error (2) recompile to use current semantics (3) apply the semantics from the time of compilation. I think Juerd probably assumed (3)\, and thus assumes that the flags need to be preserved within the regexp.

I had thought I had this solved for qr//i. The way I was planning to implement this for pattern matching is quite simple. First\, by changing the existing fold table definitions to include the Unicode semantics\, the pattern matching magically starts working without any code logic changes for all but two characters: the German sharp ss\, and the micron symbol. For these\, I was planning to use the existing mechanisms to compile the re as utf8\, so it wouldn't require any new flags. Thus qr// would be utf8 if it contained these two characters. And it works today to match such a pattern against both non-utf8 and utf8 strings. I haven't tested to see what happens when such a pattern is executed under use bytes. I was presuming it did something reasonable. But now I'm not so sure\, as I've found a number of bugs in the re code in my testing\, and some are of a nature that I don't feel comfortable with my level of knowledge about how it works to dive in and fix them. They should be fixed anyway\, and I'm hoping some expert will undertake that. I think that once they're fixed\, that I could extend them to work in the latin1 range quite easily. So the bottom line is that qr//i may or may not be a problem.

Clever\, and maybe it works\, or could be fixed to work. I can't say otherwise.

For the other interactions\, I'm not sure there is a problem. If one creates a string whether or not this mechanism is on\, it remains 8 bits\, unless it has a code point above 255. If one operates on it while this mechanism is on\, it gets unicode semantics\, which in a few cases irretrievably convert it to utf8 because the result is above 255. If one operates on it while this mechanism is off\, you get ASCII semantics. I don't really see a problem with that.

I think it would be easy to extend this to EBCDIC\, at least the three encodings perl has compiled-in tables for. The problem is that Rafael said that there's no one testing on EBCDIC machines\, so I couldn't know if it worked or not before releasing it.

No comment.

I'm also thinking that the Windows file name problems can be considered independent of this\, and addressed at a later time.

File names are currently well defined to be bytes\, in the documentation. This is\, of course\, extremely wrong and limiting on Windows. There are no good solutions; there is a solution of using special APIs Jan Dubois has written (thanks Jan)\, but it likely can be considered independently\, as much as it would be nice to solve it soon.

I also agree with Glenn's and Juerd's wording changes.

I saw nothing in my reading of the code that would lead me to touch the utf8 flag's meaning. But I am finding weird bugs in which Perl apparently gets mixed up about the flag. These vanish if I rearrange the order of supposedly independent lines in the program. It looks like it could be a wild write. I wrote a bug report [perl #59378]\, but I think that the description of that is wrong.

Could well be some bugs in the edge cases here. I doubt the tests provide full coverage. Plan on writing more tests\, if possible\, at least where you find bugs either in test code or by reading code that you are learning or changing.

So the bottom line for now\, is I'd like to get some consensus about how to turn it on and off (and whether to\, which I think the answer is there has to be a way to turn it off.) I guess I would claim that in 5.12\, "use bytes" could be used to turn it off. But that may be controversial\, and doesn't address backporting it.

For the moment\, let's call "this feature" "enhanced Unicode semantics"\, and the Unicode semantics we have today "today's Unicode semantics".

"use bytes" can't turn off "enhanced Unicode semantics"\, because it implements its own semantics that are different that "today's Unicode semantics".

In addition to the 2 Unicode semantics\, one could consider that there are two other sets of semantics... "today's bytes semantics"\, and "Glenn's proposed bytes semantics" (which eliminate character operations during use bytes sections). If my proposal is refined/further defined/accepted\, it may also have to be turned off. Doing both with one flag is probably OK\, as "use bytes" and "no bytes" differentiate sections that have bytes vs Unicode semantics.

So if it is a pragma\, I think it has to be a different one than "use bytes"\, or an extension to "use bytes" (but the name "bytes" is wrong for the switch between various Unicode semantics).

If it is "simply" done one way in 5.10 and the other way in 5.12\, there is no need for a pragma\, and also no way to disable it short of switching versions of Perl.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @davidnicol

On Mon\, Oct 6\, 2008 at 11:04 PM\, Glenn Linderman \perl@nevcal\.com wrote:

\w is meaningless on binary data\, for example\, although character classes (could be called byte classes) could still be useful without character semantics.

Lets say one is faced with a legacy delimited file that uses 0xFF for a separator. Running

use bytes; @strings = $data =~ /(\w+)/g;

could be handy.

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 10/7/2008 7:05 AM\, came the following characters from the keyboard of David Nicol:

On Mon\, Oct 6\, 2008 at 11:04 PM\, Glenn Linderman \perl@nevcal\.com wrote:

\w is meaningless on binary data\, for example\, although character classes (could be called byte classes) could still be useful without character semantics.

Lets say one is faced with a legacy delimited file that uses 0xFF for a separator. Running

use bytes; @strings = $data =~ /(\w+)/g;

could be handy.

I guess your legacy delimited file is intended to be ASCII text\, with each string delimited by 0xFF\, but that is only a guess\, since you didn't make it clear.

@strings = split ( /\xFF/\, $data )

would do the same job\, be independent of "use bytes;"\, and allow for punctuation and control characters in the @strings. You didn't state that the @strings should contain only alphanumerics\, but your code does. Of course\, even if the @strings are supposed to only contain alphanumerics\, your code would treat punctuation and control characters as additional delimiters and not only ignore the error case\, but make it impossible to detect without reexamining $data. My code would treat only \xFF as delimiters (per your specification)\, and then additional code could be written to check the resulting @strings for validity as appropriate.

You'll need to contrive a more useful\, and more completely specified example to be convincing.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @khwilliamson

From the little feedback I got on this issue and my own thoughts\, I've developed a straw proposal for comment.

I propose a global flag that says whether or not the mode previously outlined (to give full Unicode semantics to characters in the full latin1 range even when not stored as utf8) is in effect or not. This flag will be turned on or off through a lexically scoped pragma. The default for 5.12 will be on. If this gets put into 5.10.x\, the mode will be off.

This mode will be subservient to "use bytes". That is\, whenever the bytes mode is in effect\, this new mode will not be. This is in part to preserve compatibility with existing programs that explicitly use the bytes pragma.

If a string is defined under one mode but looked at under the other\, the mode in effect at the time of interpretation will be the one used.

A pattern\, however\, is compiled\, and that compilation will remain in effect even if the mode changes.

One could argue about whether the last two paragraphs are the best or not\, but doing them otherwise is a lot harder\, and it is my that it would be the very rare program that would want to toggle between these modes\, so that in practice it doesn't matter.

Comments?

p5pRT commented 15 years ago

From @rgs

2008/10/15 karl williamson \public@khwilliamson\.com:

From the little feedback I got on this issue and my own thoughts\, I've developed a straw proposal for comment.

I propose a global flag that says whether or not the mode previously outlined (to give full Unicode semantics to characters in the full latin1 range even when not stored as utf8) is in effect or not. This flag will be turned on or off through a lexically scoped pragma. The default for 5.12 will be on. If this gets put into 5.10.x\, the mode will be off.

This mode will be subservient to "use bytes". That is\, whenever the bytes mode is in effect\, this new mode will not be. This is in part to preserve compatibility with existing programs that explicitly use the bytes pragma.

If a string is defined under one mode but looked at under the other\, the mode in effect at the time of interpretation will be the one used.

A pattern\, however\, is compiled\, and that compilation will remain in effect even if the mode changes.

One could argue about whether the last two paragraphs are the best or not\, but doing them otherwise is a lot harder\, and it is my that it would be the very rare program that would want to toggle between these modes\, so that in practice it doesn't matter.

I think they're sensible.

We need also to specify what will happen when we combines to qr// patterns in a larger one\, where one was compiled under a mode\, the other one under another. The simplest thing would be to have one of the modes (the full Unicode one) take precedence\, I think.

p5pRT commented 15 years ago

From @khwilliamson

I agree with that\, and we also have to say that this is subservient as well to any locale in effect\, again for backwards compatibility when this mode becomes the default.

Rafael Garcia-Suarez wrote:

2008/10/15 karl williamson \public@khwilliamson\.com:

From the little feedback I got on this issue and my own thoughts\, I've developed a straw proposal for comment.

I propose a global flag that says whether or not the mode previously outlined (to give full Unicode semantics to characters in the full latin1 range even when not stored as utf8) is in effect or not. This flag will be turned on or off through a lexically scoped pragma. The default for 5.12 will be on. If this gets put into 5.10.x\, the mode will be off.

This mode will be subservient to "use bytes". That is\, whenever the bytes mode is in effect\, this new mode will not be. This is in part to preserve compatibility with existing programs that explicitly use the bytes pragma.

If a string is defined under one mode but looked at under the other\, the mode in effect at the time of interpretation will be the one used.

A pattern\, however\, is compiled\, and that compilation will remain in effect even if the mode changes.

One could argue about whether the last two paragraphs are the best or not\, but doing them otherwise is a lot harder\, and it is my that it would be the very rare program that would want to toggle between these modes\, so that in practice it doesn't matter.

I think they're sensible.

We need also to specify what will happen when we combines to qr// patterns in a larger one\, where one was compiled under a mode\, the other one under another. The simplest thing would be to have one of the modes (the full Unicode one) take precedence\, I think.

p5pRT commented 15 years ago

From @khwilliamson

I'm ready to start hardening my preliminary experiments into production code for this problem. I'm thinking about doing it in at least three separable patches\, each dealing with a different portion of it.

And I have some questions about coding. I am finding as a newcomer that there is a tremendous amount to learn and I don't want to waste my time (and yours) by going down dead-ends. A don't understand a lot of the nuances of the macros and functions\, and I'm afraid some aren't documented. For example\, I don't know what it means when a const is added to a macro or function name\, perhaps that the result is promised to not be modified by the caller?

I have tried to conform to the current style with two exceptions. 1) I write a lot more comments than I see in the code. I have cut down on these a lot\, but it's still more than you are used to. These can easily be removed. 2) Is there a reason that many compound statements begin on one line and are continued on another without braces\, like if (a) b; ? I learned long ago not to do that\, as it's too easy when modifying code to forget the braces are missing and to insert a statement between them that causes for example b to not be dependent on the if. This may show up in testing\, or it may be end up a bug. Unless there's some reason like machine parsing of the code\, I would like to be free to use braces in my code under these circumstances.

What about time-space tradeoffs? For time efficiency\, it would be good to implement some of the operations with table look ups. I could use 3 or 4 tables\, each with 256 entries of U8. It this ok\, or would you rather I have extra tests in code instead?

And if I do use the tables\, is the U8 typedef guaranteed to be unsigned and 8 bits\, so that I can index into these tables safely? I see some existing code that appears to assume this\, but it may not be a good example to follow\, or I may be missing some place where it got tested first. I can always mask to make sure its 8 bits\, but if some compilers/architectures don't really allow unsigned\, then that complicates things and makes the tables bigger.

The uc() function now tries to convert in place if possible. I would like to not bother with that\, but to always allocate a new scalar to return the result in. The alternative is that under some conditions\, the code would get partly through the conversion and discover that it needed more space than was available\, and have to abandon the conversion and start over (or else do an extra pass first to figure this out\, which I'm sure no one would advocate). Is it ok for me to make this change?

If I have to grow a string\, is it better to grow just enough to get by\, or should I add some extra to minimize the possibility of having to grow again? I don't know how the memory pool is handled. I presume that eventually some malloc gets done\, and its probably not for just 1 or 2 bytes. The code in the areas I've looked at currently asks for precisely the amount it needs at the moment for that string\, and there is a comment about maybe having to do it a million times\, but that's life. It would seem to me that if you need 3 bytes\, you should grow by 6\, which isn't a big waste if none more are needed\, and would halve the number of growths required in the worst case. But what is the accepted method?

I have to convert a string to utf8. There is a convenient function to do so\, bytes_to_utf8()\, but the documentation warns it is experimental. Is it ok to use this function\, and if not\, what should I use?

And when I'm through with my first batch of changes\, what should I do? I'd like to post it for code reading before submitting it as a patch. I've gotten quite a ways into the changes needed to pp.c\, for example\, and I have specific questions about why some things are done the way they are\, etc\, which I would put in comments in that place in the code. For example\, I suspect that lcfirst and ucfirst have a bug that was fixed for lc and uc in an earlier patch but the writer didn't think to appply it to the other functions\, but I don't know enough to be sure.

My experimental changes for uc\, lc\, ucfirst\, lcfirst\, \U\, \u\, \L\, and \l cause one existing test case group to fail. This is in uni/t/overload.t. It is testing that toggling the utf8 flag causes the case changing functions to work or not work depending on the flag's state. My changes cause the case functions to work no matter what that bit says\, so these tests fail. Is there some other point to these tests that I should be aware of\, so I can revise them appropriately?

Thanks

Rafael Garcia-Suarez wrote:

2008/10/15 karl williamson \public@khwilliamson\.com:

From the little feedback I got on this issue and my own thoughts\, I've developed a straw proposal for comment.

I propose a global flag that says whether or not the mode previously outlined (to give full Unicode semantics to characters in the full latin1 range even when not stored as utf8) is in effect or not. This flag will be turned on or off through a lexically scoped pragma. The default for 5.12 will be on. If this gets put into 5.10.x\, the mode will be off.

This mode will be subservient to "use bytes". That is\, whenever the bytes mode is in effect\, this new mode will not be. This is in part to preserve compatibility with existing programs that explicitly use the bytes pragma.

If a string is defined under one mode but looked at under the other\, the mode in effect at the time of interpretation will be the one used.

A pattern\, however\, is compiled\, and that compilation will remain in effect even if the mode changes.

One could argue about whether the last two paragraphs are the best or not\, but doing them otherwise is a lot harder\, and it is my that it would be the very rare program that would want to toggle between these modes\, so that in practice it doesn't matter.

I think they're sensible.

We need also to specify what will happen when we combines to qr// patterns in a larger one\, where one was compiled under a mode\, the other one under another. The simplest thing would be to have one of the modes (the full Unicode one) take precedence\, I think.

p5pRT commented 15 years ago

From @rgs

2008/10/17 karl williamson \public@khwilliamson\.com:

I'm ready to start hardening my preliminary experiments into production code for this problem. I'm thinking about doing it in at least three separable patches\, each dealing with a different portion of it.

And I have some questions about coding. I am finding as a newcomer that there is a tremendous amount to learn and I don't want to waste my time (and yours) by going down dead-ends. A don't understand a lot of the nuances of the macros and functions\, and I'm afraid some aren't documented. For example\, I don't know what it means when a const is added to a macro or function name\, perhaps that the result is promised to not be modified by the caller?

The "const" name in a macro name is (IIRC) always related to the use of the "const" type modifier to qualify its return value. That is\, you can't assign it to a non-const variable.

I have tried to conform to the current style with two exceptions. 1) I write a lot more comments than I see in the code. I have cut down on these a lot\, but it's still more than you are used to. These can easily be

I have absolutely no problem with comments ! especially in code that hairy.

I should write more\, too.

removed. 2) Is there a reason that many compound statements begin on one line and are continued on another without braces\, like if (a) b; ? I learned long ago not to do that\, as it's too easy when modifying code to forget the braces are missing and to insert a statement between them that causes for example b to not be dependent on the if. This may show up in testing\, or it may be end up a bug. Unless there's some reason like machine parsing of the code\, I would like to be free to use braces in my code under these circumstances.

Please do. Another advantage of using braces is that you can add statements in the "then" clause without modifying the "if" line\, if you write your ifs like this: if (condition) { ... } Less formatting changes\, better history.

What about time-space tradeoffs? For time efficiency\, it would be good to implement some of the operations with table look ups. I could use 3 or 4 tables\, each with 256 entries of U8. It this ok\, or would you rather I have extra tests in code instead?

Perl is usually more optimized for speed than for memory.

And if I do use the tables\, is the U8 typedef guaranteed to be unsigned and 8 bits\, so that I can index into these tables safely? I see some existing code that appears to assume this\, but it may not be a good example to follow\, or I may be missing some place where it got tested first. I can always mask to make sure its 8 bits\, but if some compilers/architectures don't really allow unsigned\, then that complicates things and makes the tables bigger.

I think it's guaranteed to be unsigned\, but not 8 bits. The U8SIZE symbol gives the size of an U8 in bytes. I'm not aware of any platform where it's not 1 byte\, though. Any C portability expert on this?

The uc() function now tries to convert in place if possible. I would like to not bother with that\, but to always allocate a new scalar to return the result in. The alternative is that under some conditions\, the code would get partly through the conversion and discover that it needed more space than was available\, and have to abandon the conversion and start over (or else do an extra pass first to figure this out\, which I'm sure no one would advocate). Is it ok for me to make this change?

I think it is. The old PV will be collected. And we can reoptimize it later.

If I have to grow a string\, is it better to grow just enough to get by\, or should I add some extra to minimize the possibility of having to grow again? I don't know how the memory pool is handled. I presume that eventually some malloc gets done\, and its probably not for just 1 or 2 bytes. The code in the areas I've looked at currently asks for precisely the amount it needs at the moment for that string\, and there is a comment about maybe having to do it a million times\, but that's life. It would seem to me that if you need 3 bytes\, you should grow by 6\, which isn't a big waste if none more are needed\, and would halve the number of growths required in the worst case. But what is the accepted method?

I'm not much familiar with perl's internal memory pools. And that kind of behaviour is difficult to choose without benchmarks. I would say\, at first\, grow by just what is needed.

I have to convert a string to utf8. There is a convenient function to do so\, bytes_to_utf8()\, but the documentation warns it is experimental. Is it ok to use this function\, and if not\, what should I use?

It's experimental since many years now. I think we could remove the "experimental" label now.

And when I'm through with my first batch of changes\, what should I do? I'd like to post it for code reading before submitting it as a patch. I've gotten quite a ways into the changes needed to pp.c\, for example\, and I have specific questions about why some things are done the way they are\, etc\, which I would put in comments in that place in the code. For example\, I suspect that lcfirst and ucfirst have a bug that was fixed for lc and uc in an earlier patch but the writer didn't think to appply it to the other functions\, but I don't know enough to be sure.

Please ask here. Are you familiar with git already\, by the way?

My experimental changes for uc\, lc\, ucfirst\, lcfirst\, \U\, \u\, \L\, and \l cause one existing test case group to fail. This is in uni/t/overload.t. It is testing that toggling the utf8 flag causes the case changing functions to work or not work depending on the flag's state. My changes cause the case functions to work no matter what that bit says\, so these tests fail. Is there some other point to these tests that I should be aware of\, so I can revise them appropriately?

You can revise them.

p5pRT commented 15 years ago

From @nwc10

On Sat\, Oct 18\, 2008 at 09:35:01AM +0200\, Rafael Garcia-Suarez wrote:

2008/10/17 karl williamson \public@khwilliamson\.com:

I have tried to conform to the current style with two exceptions. 1) I

"style" singular? :-)

write a lot more comments than I see in the code. I have cut down on these a lot\, but it's still more than you are used to. These can easily be

I have absolutely no problem with comments ! especially in code that hairy.

I should write more\, too.

More comments good. Please don't cut down\, if writing more comes naturally to you.

And if I do use the tables\, is the U8 typedef guaranteed to be unsigned and 8 bits\, so that I can index into these tables safely? I see some existing code that appears to assume this\, but it may not be a good example to follow\, or I may be missing some place where it got tested first. I can always mask to make sure its 8 bits\, but if some compilers/architectures don't really allow unsigned\, then that complicates things and makes the tables bigger.

I think it's guaranteed to be unsigned\, but not 8 bits. The U8SIZE symbol gives the size of an U8 in bytes. I'm not aware of any platform where it's not 1 byte\, though. Any C portability expert on this?

It's always going to be an unsigned char\, it's always going to be the smallest type on the platform\, and it's always going to be at least 8 bits.

I'm not sure if anyone has access to anything esoteric with 32 (or 9?) bit chars\, on which they could try compiling perl. Things can go wrong with your code's assumptions if it's more than 8 bits?

My experimental changes for uc\, lc\, ucfirst\, lcfirst\, \U\, \u\, \L\, and \l cause one existing test case group to fail. This is in uni/t/overload.t. It is testing that toggling the utf8 flag causes the case changing functions to work or not work depending on the flag's state. My changes cause the case functions to work no matter what that bit says\, so these tests fail. Is there some other point to these tests that I should be aware of\, so I can revise them appropriately?

You can revise them.

They were probably written by me\, to ensure that the current behaviour is consistent with or without overloading. In particular that overloading couldn't break if the subroutine implementing it was inconsistent (or malicious) in what it returned. What had been happening was that the UTF-8 flag was getting set on the (outer) scalar\, and then if the implementation on a subsequent call returned something that was not UTF-8 (and marked as not UTF-8) then the byte sequence was propagated\, but not the change to the UTF-8 flag\, resulting in a corrupt scalar.

Annotated history:

http://public.activestate.com/cgi-bin/perlbrowse/b/t/uni/overload.t

So yes\, revise them to behave correctly as per the new world order.

Nicholas Clark

p5pRT commented 15 years ago

From @khwilliamson

Nicholas Clark wrote:

...

And if I do use the tables\, is the U8 typedef guaranteed to be unsigned and 8 bits\, so that I can index into these tables safely? I see some existing code that appears to assume this\, but it may not be a good example to follow\, or I may be missing some place where it got tested first. I can always mask to make sure its 8 bits\, but if some compilers/architectures don't really allow unsigned\, then that complicates things and makes the tables bigger. I think it's guaranteed to be unsigned\, but not 8 bits. The U8SIZE symbol gives the size of an U8 in bytes. I'm not aware of any platform where it's not 1 byte\, though. Any C portability expert on this?

It's always going to be an unsigned char\, it's always going to be the smallest type on the platform\, and it's always going to be at least 8 bits.

I'm not sure if anyone has access to anything esoteric with 32 (or 9?) bit chars\, on which they could try compiling perl. Things can go wrong with your code's assumptions if it's more than 8 bits?

I only wanted to know if I have to be think about exceeding array bounds. If it's exactly 8 bits and unsigned\, there's no way for it to reference outside a 256 element array.

p5pRT commented 15 years ago

From @nwc10

On Sun\, Oct 19\, 2008 at 08:59:57PM -0600\, karl williamson wrote:

I only wanted to know if I have to be think about exceeding array bounds. If it's exactly 8 bits and unsigned\, there's no way for it to reference outside a 256 element array.

I think that the perl source code is already making that assumption in places.

But I doubt that there's a size or speed penalty in masking it with 0xFF\, as any sane optimiser would spot that it's a no-op and eliminate the code. I tried:

$ cat index.c #include \<stdlib.h> #include \<stdio.h>

#ifndef MASK # define MASK #endif

static unsigned char buffer[256];

int main (int argc\, char **argv) { unsigned int count = sizeof(buffer);

while (count--) buffer[count MASK] = ~count;

while (*++argv) { const unsigned char i = (unsigned char) atoi(*argv); printf("%s: %d\n"\, *argv\, buffer[i MASK]); }

return 0; }

$ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o $ gcc -Wall -o index -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o

Nicholas Clark

p5pRT commented 15 years ago

From @mhx

On 2008-10-20\, at 19:27:43 +0100\, Nicholas Clark wrote:

$ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o $ gcc -Wall -o index -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o

Despite the fact that you're most probably right\, I don't think your second run of gcc actually updated index.o... ;)

Marcus

-- The only difference between a car salesman and a computer salesman is that the car salesman knows he's lying.

p5pRT commented 15 years ago

From @nwc10

On Tue\, Oct 21\, 2008 at 01:07:13AM +0200\, Marcus Holland-Moritz wrote:

On 2008-10-20\, at 19:27:43 +0100\, Nicholas Clark wrote:

$ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o $ gcc -Wall -o index -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o

Despite the fact that you're most probably right\, I don't think your second run of gcc actually updated index.o... ;)

Well spotted. This one did :-)

$ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 21 04:06 index.o $ gcc -Wall -c -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 21 04:07 index.o

Show your data as well as your conclusions. It lets other people verify your conclusions. (And\, sometimes\, more importantly\, draw different conclusions if your conclusions are wrong\, but your observations are valid)

Nicholas Clark

p5pRT commented 15 years ago

From blgl@hagernas.com

In article \20081021030911\.GL49335@plum\.flirble\.org\, nick@ccl4.org (Nicholas Clark) wrote:

On Tue\, Oct 21\, 2008 at 01:07:13AM +0200\, Marcus Holland-Moritz wrote:

On 2008-10-20\, at 19:27:43 +0100\, Nicholas Clark wrote:

$ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o $ gcc -Wall -o index -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 20 19:26 index.o

Despite the fact that you're most probably right\, I don't think your second run of gcc actually updated index.o... ;)

Well spotted. This one did :-)

$ gcc -Wall -c -O index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 21 04:06 index.o $ gcc -Wall -c -O -DMASK='& 255' index.c $ ls -l index.o -rw-r--r-- 1 nick nick 1752 Oct 21 04:07 index.o

Show your data as well as your conclusions. It lets other people verify your conclusions. (And\, sometimes\, more importantly\, draw different conclusions if your conclusions are wrong\, but your observations are valid)

A small change in instruction count can be hidden by alignment padding. Try comparing the actual assembly output instead.

$ gcc -Wall -O -o index1.s -S index.c $ gcc -DMASK='& 255' -Wall -O -o index2.s -S index.c $ diff index1.s index2.s 21c21 \< la r2\,lo16(_buffer-"L00000000001$pb")(r2)

  la r11\,lo16$\_buffer\-"L00000000001$pb"$$r2$
24a25 rlwinm r2\,r9\,0\,24\,31 26c27 \< stbx r0\,r2\,r9

  stbx r0\,r11\,r2

This is an Apple-built gcc 4.0.1 for 32-bit PowerPC\, and a rotate-and-mask instruction _is_ generated.

/Bo Lindbergh

p5pRT commented 15 years ago

From @khwilliamson

I'm implementing some changes to fix this\, which Rafael has indicated could be a new mode of operation\, default off in 5.10\, and on in 5.12\, and indicated so by use of a lexically-scoped pragma.

The problem is the name to use in the pragma. In my experimental version\, I'm using

use latin1; no latin1;

But\, maybe someone has a better idea. I thought someone had suggested use unicode; but I can't find the email that said that\, and I don't think no unicode; gives the right idea\, as we are still using unicode semantics for characters outside the range of 128-255.

use unicode_semantics; is too long\, and again no unicode_semantics overstates what is turned off.

What is really meant is "use unicode semantics for latin1 non-ascii characters" and "no unicode semantics for latin1 non-ascii characters"

Another way of looking at it might be no C_locale; use C_locale; but I don't like that as well for several reasons\, one of which people may not know what the C locale is.

So\, any better ideas?

p5pRT commented 15 years ago

From perl@nevcal.com

On approximately 10/23/2008 12:10 PM\, came the following characters from the keyboard of karl williamson:

we are still using unicode semantics for characters outside the range of 128-255.

So\, any better ideas?

Better? Well\, at least more ideas...

These all to enable your fixes\, swap use/no to disable.

use fix_c128_255; # :)

use pure_uni; use uni_pure; use unipure; use codemode; use all_uni; use clean_uni; use real_uni; no uni_compat; no broken_unicode; no broken_uni; no buggy_unicode; no buggy_uni;

I extremely dislike use latin1; because implicit conversions would happen even with no latin1; (same problem you have with no unicode;).

no C_locale; has the problem that locale implies things like number\, money\, and date and time formatting\, and character classes and collation\, in addition to (perhaps) a default character encoding\, and there is already a broken? locale module\, I believe\, which would be confusing\, and C_locale wouldn't mean any of these things.

Of my ideas above\, I sort of prefer the last 4... but maybe someone else will suggest the best name.

-- Glenn -- http://nevcal.com/

A protocol is complete when there is nothing left to remove. -- Stuart Cheshire\, Apple Computer\, regarding Zero Configuration Networking

p5pRT commented 15 years ago

From @davidnicol

you could go pure-historical and set a naming precedent for such things with

use fix58182;

p5pRT commented 15 years ago

From kevinw@activestate.com

David Nicol wrote:

you could go pure-historical and set a naming precedent for such things with

use fix58182;

You forgot to use a smilie. Don't scare me like that! ;)

Seems to me that something that mentions the version of Unicode that the behaviour conforms to might be worth a shot. If I remember the details of the underlying problem correctly\, then:

use unicode_31;

May be a good way to go. This would at least give someone unfamiliar with the whole issue a place to start looking.

Cheers\,

kjw

p5pRT commented 15 years ago

From @moritz

karl williamson wrote:

I'm implementing some changes to fix this\, which Rafael has indicated could be a new mode of operation\, default off in 5.10\, and on in 5.12\, and indicated so by use of a lexically-scoped pragma.

The problem is the name to use in the pragma. In my experimental version\, I'm using

use latin1; no latin1;

I don't think that quite cuts it.

But\, maybe someone has a better idea.

I propose use unisane;

"uni" is already used in some places as an abbreviation of "Unicode" (like in the names of perluniintro and perlunitut man pages)\, and IMHO the old behaviour is quite insane. So if you say "no unisane" you'll clearly stating that you want insane behaviour\, and you'll get it.

I thought someone had suggested use unicode; but I can't find the email that said that\, and I don't think no unicode; gives the right idea\, as we are still using unicode semantics for characters outside the range of 128-255.

Aye.

use unicode_semantics; is too long\, and again no unicode_semantics overstates what is turned off.

What is really meant is "use unicode semantics for latin1 non-ascii characters" and "no unicode semantics for latin1 non-ascii characters"

Another way of looking at it might be no C_locale; use C_locale; but I don't like that as well for several reasons\, one of which people may not know what the C locale is.

Locales and Unicode are somewhat orthogonal concepts\, and there's already too much confusion about their interaction without you adding even more to it ;-)

Cheers\, Moritz

p5pRT commented 15 years ago

From @tux

On Thu\, 23 Oct 2008 13:29:32 -0700\, "Kevin J. Woolley" \kevinw@activestate\.com wrote:

David Nicol wrote:

you could go pure-historical and set a naming precedent for such things with

use fix58182;

You forgot to use a smilie. Don't scare me like that! ;)

Seems to me that something that mentions the version of Unicode that the behaviour conforms to might be worth a shot. If I remember the details of the underlying problem correctly\, then:

use unicode_31;

May be a good way to go. This would at least give someone unfamiliar with the whole issue a place to start looking.

I agree with Kevin.

uni isn't descriptive enough.

use unicode_\;

seems a much more sane approach.

-- H.Merijn Brand Amsterdam Perl Mongers http://amsterdam.pm.org/ using & porting perl 5.6.2\, 5.8.x\, 5.10.x\, 5.11.x on HP-UX 10.20\, 11.00\, 11.11\, 11.23\, and 11.31\, SuSE 10.1\, 10.2\, and 10.3\, AIX 5.2\, and Cygwin. http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Perl / perl5

Inconsistent and wrong handling of 8th bit set chars with no locale #9455

From @khwilliamson

From @moritz

From @druud62

From @nothingmuch

From @nothingmuch

From @khwilliamson

From perl@nevcal.com

-- Glenn -- http​://nevcal.com/

From @andk

From @rgs

From @Juerd

From @Juerd

From @Juerd

From @Juerd

From @druud62

From @ikegami

From @iabyn

From @Juerd

From perl@nevcal.com

-- Glenn -- http​://nevcal.com/

From @khwilliamson

From perl@nevcal.com

-- Glenn -- http​://nevcal.com/

From @rgs

From @khwilliamson

From vadim@vkonovalov.ru

From @khwilliamson

From perl@nevcal.com

-- Glenn -- http​://nevcal.com/

From @Juerd

From @druud62

From vadim@vkonovalov.ru

From @khwilliamson

Juerd wrote​:

From perl@nevcal.com

Juerd wrote​:

-- Glenn -- http​://nevcal.com/

From @davidnicol

From perl@nevcal.com

-- Glenn -- http​://nevcal.com/

From @khwilliamson

From @rgs

From @khwilliamson

From @khwilliamson

From @rgs

From @nwc10

From @khwilliamson

From @nwc10

From @mhx

From @nwc10

From blgl@hagernas.com

From @khwilliamson

From perl@nevcal.com

-- Glenn -- http​://nevcal.com/

From @davidnicol

From kevinw@activestate.com

From @moritz

From @tux

-- Glenn -- http://nevcal.com/

-- Glenn -- http://nevcal.com/

-- Glenn -- http://nevcal.com/

-- Glenn -- http://nevcal.com/

Juerd wrote:

Juerd wrote:

-- Glenn -- http://nevcal.com/

-- Glenn -- http://nevcal.com/

-- Glenn -- http://nevcal.com/