Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.9k stars 538 forks source link

implicit utf8ification causes action-at-distance bugs #6199

Closed p5pRT closed 21 years ago

p5pRT commented 21 years ago

Migrated from rt.perl.org#19743 (status was 'resolved')

Searchable as RT19743$

p5pRT commented 21 years ago

From @jhi

Created by jhi@ugli.hut.fi

As discussed in the perl-unicode@​perl.org mailing list (Subject​: CGI and UTF\, see http​://archive.develooper.com/perl-unicode@​perl.org/) and argued by Benjamin Franz \snowhare@​nihongo\.org\, the implicit turning on of UTF-8-ness on filehandles based on locale setup can cause nasty action-at-distance messups.

The crux of the matter seems to be that reading in illegal UTF-8 data does not trigger any warnings\, but only later trying to use the data does. In this example (should work in Linuxes with utf8 locales installed) it's the ord() that gets punished\, not the \<>. Now imagine a few hundred lines of code between the \<> and the ord() and you'll see the "distance-in-action".

$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -le '$a=\;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line 1. 0

I don't know yet what's the best way to solve this without slowing down all I/O (well\, just "I") too much. Benjamin suggests either a (mandatory?) warning when the UTF-8-ness is switched on filehandles because of locale setting\, or some explicit switch to enable the feature. But in any case\, the issue has now been recorded.

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.8.0: Configured by jhi at Sat Jan 4 02:35:53 EET 2003. Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration: Platform: osname=linux, osvers=2.4.20-xfs, archname=i686-linux-thread-multi uname='linux ugli.hut.fi 2.4.20-xfs #2 wed dec 4 11:01:03 eet 2002 i686 unknown ' config_args='-des -Duseithreads' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O3', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -fno-strict-aliasing' ccversion='', gccversion='3.2.1 20020922 (prerelease)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldb -ldl -lm -lpthread -lc -lcrypt -lutil perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.2.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: MAINT18379 @INC for perl v5.8.0: lib /usr/local/lib/perl5/5.8.0/i686-linux-thread-multi /usr/local/lib/perl5/5.8.0 /usr/local/lib/perl5/site_perl/5.8.0/i686-linux-thread-multi /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl . Environment for perl v5.8.0: HOME=/u/vieraat/vieraat/jhi LANG=C LANGUAGE (unset) LC_ALL=fi_FI.ISO8859-1 LC_CTYPE=fi_FI.ISO8859-1 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/u/vieraat/vieraat/jhi/Perl/bin:/u/vieraat/vieraat/jhi/.s:/u/vieraat/vieraat/jhi/.b/Linux:/c/bin:/p/bin:/p/adm/bin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin:/usr/lib:/etc:/lib:/p/X6/bin:/usr/bin/X11:/u/vieraat/vieraat/jhi PERL_BADLANG (unset) SHELL=/bin/zsh ```
p5pRT commented 21 years ago

From @jhi

Another way of looking at this is that the behaviour of switching on UTF-8-ness based on locale *WITHOUT* the user having said 'use locale' is unprecedented. Maybe 'use locale' should be required\, or maybe 'use locale "utf8"'\, or something completely different\, like some PERL_FOOBAR environment variable?

p5pRT commented 21 years ago

From dankogai@dan.co.jp

jhi and Porters\,

  A happy new year.

On Monday\, Jan 6\, 2003\, at 02​:12 Asia/Tokyo\, Jarkko Hietaniemi (via RT) wrote​:

# New Ticket Created by Jarkko Hietaniemi # Please include the string​: [perl #19743] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt2/Ticket/Display.html?id=19743 >

This is a bug report for perl from jhi@​ugli.hut.fi\, generated with the help of perlbug 1.34 running under perl v5.8.0.

----------------------------------------------------------------- [Please enter your report here]

As discussed in the perl-unicode@​perl.org mailing list (Subject​: CGI and UTF\, see http​://archive.develooper.com/perl-unicode@​perl.org/) and argued by Benjamin Franz \snowhare@&#8203;nihongo\.org\, the implicit turning on of UTF-8-ness on filehandles based on locale setup can cause nasty action-at-distance messups.

The crux of the matter seems to be that reading in illegal UTF-8 data does not trigger any warnings\, but only later trying to use the data does. In this example (should work in Linuxes with utf8 locales installed) it's the ord() that gets punished\, not the \<>. Now imagine a few hundred lines of code between the \<> and the ord() and you'll see the "distance-in-action".

$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -le '$a=\;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line > 1. 0

I am still away from home where I can test various perl builds further so this reply is not definitive. But the quick test shows that upgrading Encode should solve this problem. See this.

% perl -MEncode -e 'print Encode->VERSION\, "\n"' 1.83

perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 perl -le '$a=\;print ord($a)' perl​: warning​: Setting locale failed. perl​: warning​: Please check that your locale settings​: LC_ALL = "en_US.utf8"\, LANG = (unset) are supported and installed on your system. perl​: warning​: Falling back to the standard locale ("C"). 255

Since this is on MacOS X 10.2.3\, Setting locale fails as expected hence warning but ${^ENCODING} is successfully set to utf8 and you get the correct result.

Encode prior to version 1.80 had problem with 'use encoding "utf8"' and that methinks is the cause of problem

Dan the Encode Maintainer

p5pRT commented 21 years ago

From @jhi

I am still away from home where I can test various perl builds further so this reply is not definitive. But the quick test shows that upgrading Encode should solve this problem. See this.

% perl -MEncode -e 'print Encode->VERSION\, "\n"' 1.83

I don't think this is the issue here at all. People are complaining basically about two things​:

(1) They don't like the feature of /utf-?8/i in the locale setup turning on silently the utf8ness of filehandles (thus effectively breaking any "binary" filehandles on existing code)\, ESPECIALLY because they never said 'use locale' (2) That reading in illegal UTF-8 (like the byte 255) won't barf when the input happens\, only later if the malformed data is being used.

This​:

$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -Ilib -le '$a=\;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line 1. 0

happens *with* Encode 1.83 in Linux.

perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 perl -le '$a=\;print ord($a)' perl​: warning​: Setting locale failed. perl​: warning​: Please check that your locale settings​: LC_ALL = "en_US.utf8"\, LANG = (unset) are supported and installed on your system. perl​: warning​: Falling back to the standard locale ("C"). 255

Since this is on MacOS X 10.2.3\, Setting locale fails as expected hence warning but ${^ENCODING} is successfully set to utf8 and you get the correct result.

Encode prior to version 1.80 had problem with 'use encoding "utf8"' and that methinks is the cause of problem

Dan the Encode Maintainer

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From dankogai@dan.co.jp

On Tuesday\, Jan 7\, 2003\, at 13​:13 Asia/Tokyo\, Jarkko Hietaniemi wrote​:

I don't think this is the issue here at all. People are complaining basically about two things​:

(1) They don't like the feature of /utf-?8/i in the locale setup turning on silently the utf8ness of filehandles (thus effectively breaking any "binary" filehandles on existing code)\, ESPECIALLY because they never said 'use locale'

Okay\, this one is beyond the responsibility of (Encode|encoding).pm since ${^ENCODING} is set by perl core. But I don't like making Linux or other locale-savvy environments an exceptional case....

(2) That reading in illegal UTF-8 (like the byte 255) won't barf when the input happens\, only later if the malformed data is being used.

Sounds like conflicting request against (1). To meet (1) you have to turn ${^ENCODING} off but to meet (2) you have to turn ${^ENCODING} on to have Encode detect malformed data.

This​:

$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -Ilib -le '$a=\;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line > 1. 0

happens *with* Encode 1.83 in Linux.

Sounds like you need a working locale to duplicate the problem.

Anyway\, have a careful look at the error message; "at -e line 1\, \ line 1."; does that mean perl barfs whenever the input happens? If perl barfs when the malformed data is being used it isn't supposed to barf on \. It looks to me that (2) is already solved.

IMHO\, the easiest way to solve the problem is to set ${^ENCODING} when and only when

0. locale is set on environmnet 1. "use locale" is EXPLICITY set.

So far "use locale" is implicitly done. What does

  env LC_ALL=en_US.utf8 ./perl -Ilib -e 'print ${^ENCODING}->name'

say? It should print 'utf8' where en_US.utf8 works and error where not.

Dan the Encode Maintainer

p5pRT commented 21 years ago

From @tamias

On Tue\, Jan 07\, 2003 at 01​:42​:39PM +0900\, Dan Kogai wrote​:

On Tuesday\, Jan 7\, 2003\, at 13​:13 Asia/Tokyo\, Jarkko Hietaniemi wrote​:

$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -Ilib -le '$a=\;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line > 1. 0

happens *with* Encode 1.83 in Linux.

Sounds like you need a working locale to duplicate the problem.

Anyway\, have a careful look at the error message; "at -e line 1\, \ line 1."; does that mean perl barfs whenever the input happens? If perl barfs when the malformed data is being used it isn't supposed to barf on \. It looks to me that (2) is already solved.

Look closer​:

"Malformed UTF-8 character ... in ord"

The error occurs when the data is used in ord()\, not when it is read in.

"... \ line 1." gives the most recently read input handle\, and the number of times it's been read from\, just in case that information is useful in debugging the problem. However\, it is not necessarily relevant to the specific warning or error​:

% echo 1 | perl -we '$_ = \; $b = undef; $b = $b + 2;' Use of uninitialized value at -e line 1\, \ chunk 1.

Ronald

p5pRT commented 21 years ago

From @jhi

$ env LC_ALL=en_US.utf8 ./perl -Ilib -e 'print ${^ENCODING}->name' Can't call method "name" on an undefined value at -e line 1.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From @jhi

Sounds like you need a working locale to duplicate the problem.

Yes\, a working UTF-8 locale.

IMHO\, the easiest way to solve the problem is to set ${^ENCODING} when and only when

0. locale is set on environmnet 1. "use locale" is EXPLICITY set.

It would seem that currently ${^ENCODING} is not set at all by this UTF-8 locale thing (only ${^OPEN} is\, see perl.c at about line 1520 or so). That would explain the problem #2.

So far "use locale" is implicitly done. What does

I think arguably people could say that by 'use locale' they meant just the old byte-based locale\, none of this fancy Unicode stuff. Maybe we should go e.g. 'use locale "utf8"' to enable this new feature.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From perl5-porters@ton.iguana.be

In article \20030107134029\.GH285996@&#8203;lyta\.hut\.\_i\,   Jarkko Hietaniemi \jhi@&#8203;iki\.fi writes​:

I think arguably people could say that by 'use locale' they meant just the old byte-based locale\, none of this fancy Unicode stuff. Maybe we should go e.g. 'use locale "utf8"' to enable this new feature.

(I'm coming into this in the middle and my utf-8 knowledge is pretty shaky\, so forgive me if I'm missing philosophies already in place. I'll boldly go where angels fear to tread anyways)

mm\, is that needed ? if you run a "use locale" program\, you're already supposed to care a lot about whats in the locale environment vars. If these contain utf-8\, that's hardly by accident. Basically it's not the programmer but the user that runs the program that knows if his files are in utf8. So it sees to me that the way to read files should by default come from the environment\, not the code.

So I suppose what you're discussing here is the conundrum for the *programmer*. He wants to allow the user to indeed control how his textfiles are read\, but if he just writes "use locale"\, he not only gets potential utf8 chars\, he also changes the semantics of string ops to use all these fancy region features. And not only that\, he may also get currency symbols and date formats etc. But is it so terrible to force a programmer who cares enough to allow his users the full unicode charset to also allow the user to specify the region semantics ? It's hardly makes sense to e.g. have utf-8 files and still use /[a-zA-Z_]/ instead of /\w/ in your regexes. I think that's in general users who care enough about international characters to set their locale also want to get their preferred region semantics (at least for the character related part of the local)\, so we should encourage programmers to give it to them.

Still\, perl not being about bondage\, it of course still makes sense to allow the programmer to ONLY set "i want to read files as utf-8 IF the user specified a utf-8 style locale"\, but that should be an exceptional thing and get a long name to stress that.

So it makes sense that normaly you're supposed to use 'use locale' in programs that will do "utf-8 on filehandles by default if so specified by the user". And for example use 'use locale "utf8ness_only"' if the programmer abolutely does not want to think about region semantics. And maybe something like 'use locale "string_stuff_only"' if you want user specied utf8-ness\, right collation and \w and the like\, but no mangled dates.

PS​: I don't notice any discussion about the scope of whatever solution   is chosen. Lexically scoped seems most sane\, but what if you want   to use some module that opens a file for you by proxy   (e.g. File​::Tail) ? Especially if that module up to now worked for   both text and binary files.

p5pRT commented 21 years ago

From @jhi

One half of this problem has now been fixed (by Encode 1.84)\, illegal UTF-8 will now warn (-w) immediately when read in.

For the second half\, the implicit UTF-8-ification\, Sarathy suggests extending the semantics of the -C switch (currently only meaningful in Win32 platforms). (Sarathy also points out that the lexical semantics of 'use locale' really wouldn't work out that well with the global effects of the UTF-8-ification.) In other words\, one would need to use -C to get the utf-8-fy locale settings affecting the I/O layers.

p5pRT commented 21 years ago

From @jhi

In our previous episode we found out that there were two problems inherent in the implicit UTF-8-ification​:

(1) The UTF-8 kicked in even when the user didn't ask for it.   Lots of people using RH 8.0 have been bitten by this because   the default locales are UTF-8.

(2) Even when and if the user wanted it\, reading in malformed UTF-8   didn't do anything *immediately*. It was only later when and if   further operations were attempted on the malformed data that the   sad state was detected.

The issue (2) was fixed by Encode 1.84\, now the \<> (et alia) detect the evil data. (Though some further hacking may be required\, a single UTF-8 tr/// test was broken by the Encode 1.84.)

So the issue (1) still would remain but the following patch attempts to rectify the situation\, by making the UTF-8-ification explicit instead of implicit.

This patch (inlined since last time something ate my attachment) hijacks the -C switch (as suggested by Sarathy) to do the enabling of UTF-8-fied I/O. So no more implicit UTF-8 based on locale settings. (Use of the locale pragma wouldn't have worked that well since it is lexical in scope\, while the UTF-8 decision is rather global in scope.) I added also an alternative way of enabling this feature​: setting the $ENV{PERL_UTF8_LOCALE} to true (the -C\, if present\, wins).

In a perverse way going explicit is bad news since the implicit UTF-8-ification has certainly shaken many evil bugs out of the 5.8.0 tree (the B0B bug comes to mind\, for example). Maybe for those platforms that have UTF-8 locales a new column of smoke testing (with env PERL_UTF8_LOCALE=1 LC_ALL=xx_YY.UTF-8) would be in order.

==== //depot/perl/embedvar.h#156 - /u/vieraat/vieraat/jhi/pp4/perl/embedvar.h ==== Index​: perl/embedvar.h

Inline Patch ```diff --- perl/embedvar.h.~1~ Tue Jan 14 17:29:10 2003 +++ perl/embedvar.h Tue Jan 14 17:29:10 2003 @@ -413,10 +413,10 @@ #define PL_utf8_toupper (vTHX->Iutf8_toupper) #define PL_utf8_upper (vTHX->Iutf8_upper) #define PL_utf8_xdigit (vTHX->Iutf8_xdigit) +#define PL_utf8locale (vTHX->Iutf8locale) #define PL_uudmap (vTHX->Iuudmap) #define PL_wantutf8 (vTHX->Iwantutf8) #define PL_warnhook (vTHX->Iwarnhook) -#define PL_widesyscalls (vTHX->Iwidesyscalls) #define PL_xiv_arenaroot (vTHX->Ixiv_arenaroot) #define PL_xiv_root (vTHX->Ixiv_root) #define PL_xnv_arenaroot (vTHX->Ixnv_arenaroot) @@ -702,10 +702,10 @@ #define PL_Iutf8_toupper PL_utf8_toupper #define PL_Iutf8_upper PL_utf8_upper #define PL_Iutf8_xdigit PL_utf8_xdigit +#define PL_Iutf8locale PL_utf8locale #define PL_Iuudmap PL_uudmap #define PL_Iwantutf8 PL_wantutf8 #define PL_Iwarnhook PL_warnhook -#define PL_Iwidesyscalls PL_widesyscalls #define PL_Ixiv_arenaroot PL_xiv_arenaroot #define PL_Ixiv_root PL_xiv_root #define PL_Ixnv_arenaroot PL_xnv_arenaroot ```

==== //depot/perl/gv.c#178 - /u/vieraat/vieraat/jhi/pp4/perl/gv.c ==== Index​: perl/gv.c

Inline Patch ```diff --- perl/gv.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/gv.c Tue Jan 14 17:29:10 2003 @@ -974,9 +974,15 @@ goto ro_magicalize; else break; + case '\025': + if (len > 1 && strNE(name, "\025TF8_LOCALE")) + break; + goto ro_magicalize; + case '\027': /* $^W & $^WARNING_BITS */ - if (len > 1 && strNE(name, "\027ARNING_BITS") - && strNE(name, "\027IDE_SYSTEM_CALLS")) + if (len > 1 + && strNE(name, "\027ARNING_BITS") + ) break; goto magicalize; @@ -1793,10 +1799,13 @@ goto yes; } break; + case '\025': + if (len > 1 && strEQ(name, "\025TF8_LOCALE")) + goto yes; case '\027': /* $^W & $^WARNING_BITS */ if (len == 1 || (len == 12 && strEQ(name, "\027ARNING_BITS")) - || (len == 17 && strEQ(name, "\027IDE_SYSTEM_CALLS"))) + ) { goto yes; } ```

==== //depot/perl/intrpvar.h#112 - /u/vieraat/vieraat/jhi/pp4/perl/intrpvar.h ==== Index​: perl/intrpvar.h

Inline Patch ```diff --- perl/intrpvar.h.~1~ Tue Jan 14 17:29:10 2003 +++ perl/intrpvar.h Tue Jan 14 17:29:10 2003 @@ -48,7 +48,7 @@ */ PERLVAR(Idowarn, U8) -PERLVAR(Iwidesyscalls, bool) /* wide system calls */ +PERLVAR(Iutf8locale, bool) /* utf8 locale detected */ PERLVAR(Idoextract, bool) PERLVAR(Isawampersand, bool) /* must save all match strings */ PERLVAR(Iunsafe, bool) ```

==== //depot/perl/locale.c#10 - /u/vieraat/vieraat/jhi/pp4/perl/locale.c ==== Index​: perl/locale.c

Inline Patch ```diff --- perl/locale.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/locale.c Tue Jan 14 17:29:10 2003 @@ -475,7 +475,7 @@ #ifdef USE_PERLIO { - /* Set PL_wantutf8 to TRUE if using PerlIO _and_ + /* Set PL_utf8locale to TRUE if using PerlIO _and_ any of the following are true: - nl_langinfo(CODESET) contains /^utf-?8/i - $ENV{LC_ALL} contains /^utf-?8/i @@ -487,37 +487,44 @@ it overrides LC_MESSAGES for GNU gettext, and it also can have more than one locale, separated by spaces, in case you need to know.) - If PL_wantutf8 is true, perl.c:S_parse_body() - will turn on the PerlIO :utf8 discipline on STDIN, STDOUT, - STDERR, _and_ the default open discipline. + If PL_utf8locale and PL_wantutf8 (set by -C) are true, + perl.c:S_parse_body() will turn on the PerlIO :utf8 layer + on STDIN, STDOUT, STDERR, _and_ the default open discipline. */ - bool wantutf8 = FALSE; + bool utf8locale = FALSE; char *codeset = NULL; #if defined(HAS_NL_LANGINFO) && defined(CODESET) codeset = nl_langinfo(CODESET); #endif if (codeset) - wantutf8 = (ibcmp(codeset, "UTF-8", 5) == 0 || - ibcmp(codeset, "UTF8", 4) == 0); + utf8locale = (ibcmp(codeset, "UTF-8", 5) == 0 || + ibcmp(codeset, "UTF8", 4) == 0); #if defined(USE_LOCALE) else { /* nl_langinfo(CODESET) is supposed to correctly * interpret the locale environment variables, * but just in case it fails, let's do this manually. */ if (lang) - wantutf8 = (ibcmp(lang, "UTF-8", 5) == 0 || - ibcmp(lang, "UTF8", 4) == 0); + utf8locale = (ibcmp(lang, "UTF-8", 5) == 0 || + ibcmp(lang, "UTF8", 4) == 0); #ifdef USE_LOCALE_CTYPE if (curctype) - wantutf8 = (ibcmp(curctype, "UTF-8", 5) == 0 || - ibcmp(curctype, "UTF8", 4) == 0); + utf8locale = (ibcmp(curctype, "UTF-8", 5) == 0 || + ibcmp(curctype, "UTF8", 4) == 0); #endif if (lc_all) - wantutf8 = (ibcmp(lc_all, "UTF-8", 5) == 0 || - ibcmp(lc_all, "UTF8", 4) == 0); + utf8locale = (ibcmp(lc_all, "UTF-8", 5) == 0 || + ibcmp(lc_all, "UTF8", 4) == 0); #endif /* USE_LOCALE */ } - if (wantutf8) - PL_wantutf8 = TRUE; + if (utf8locale) + PL_utf8locale = TRUE; + } + /* Set PL_wantutf8 to $ENV{PERL_UTF8_LOCALE} if using PerlIO. + This is an alternative to using the -C command line switch + (the -C if present will override this). */ + { + char *p = PerlEnv_getenv("PERL_UTF8_LOCALE"); + PL_wantutf8 = p ? (bool) atoi(p) : FALSE; } #endif ```

==== //depot/perl/mg.c#246 - /u/vieraat/vieraat/jhi/pp4/perl/mg.c ==== Index​: perl/mg.c

Inline Patch ```diff --- perl/mg.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/mg.c Tue Jan 14 17:29:10 2003 @@ -662,7 +662,11 @@ ? (PL_taint_warn || PL_unsafe ? -1 : 1) : 0); break; - case '\027': /* ^W & $^WARNING_BITS & ^WIDE_SYSTEM_CALLS */ + case '\025': /* $^UTF8_LOCALE */ + if (strEQ(mg->mg_ptr, "\025TF8_LOCALE")) + sv_setiv(sv, (IV) (PL_wantutf8 && PL_utf8locale)); + break; + case '\027': /* ^W & $^WARNING_BITS */ if (*(mg->mg_ptr+1) == '\0') sv_setiv(sv, (IV)((PL_dowarn & G_WARN_ON) ? TRUE : FALSE)); else if (strEQ(mg->mg_ptr+1, "ARNING_BITS")) { @@ -679,8 +683,6 @@ } SvPOK_only(sv); } - else if (strEQ(mg->mg_ptr+1, "IDE_SYSTEM_CALLS")) - sv_setiv(sv, (IV)PL_widesyscalls); break; case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': case '&': @@ -1925,7 +1927,13 @@ PL_basetime = (Time_t)(SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv)); #endif break; - case '\027': /* ^W & $^WARNING_BITS & ^WIDE_SYSTEM_CALLS */ + case '\025': /* $^UTF8_LOCALE */ + if (SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv)) + PL_wantutf8 = PL_utf8locale; + else + PL_wantutf8 = FALSE; + break; + case '\027': /* ^W & $^WARNING_BITS */ if (*(mg->mg_ptr+1) == '\0') { if ( ! (PL_dowarn & G_WARN_ALL_MASK)) { i = SvIOK(sv) ? SvIVX(sv) : sv_2iv(sv); @@ -1967,8 +1975,6 @@ } } } - else if (strEQ(mg->mg_ptr+1, "IDE_SYSTEM_CALLS")) - PL_widesyscalls = (bool)SvTRUE(sv); break; case '.': if (PL_localizing) { ```

==== //depot/perl/perl.c#461 - /u/vieraat/vieraat/jhi/pp4/perl/perl.c ==== Index​: perl/perl.c

Inline Patch ```diff --- perl/perl.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/perl.c Tue Jan 14 17:29:10 2003 @@ -1355,10 +1355,11 @@ if (!PL_do_undump) init_postdump_symbols(argc,argv,env); - /* PL_wantutf8 is conditionally turned on by + /* PL_utf8locale is conditionally turned on by * locale.c:Perl_init_i18nl10n() if the environment - * look like the user wants to use UTF-8. */ - if (PL_wantutf8) { /* Requires init_predump_symbols(). */ + * look like the user wants to use UTF-8. + * PL_wantutf8 is turned on by -C or by $ENV{PERL_UTF8_LOCALE}. */ + if (PL_utf8locale && PL_wantutf8) { /* Requires init_predump_symbols(). */ IO* io; PerlIO* fp; SV* sv; @@ -2156,7 +2157,7 @@ return s + numlen; } case 'C': - PL_widesyscalls = TRUE; + PL_wantutf8 = TRUE; /* Can be set earlier by $ENV{PERL_UTF8_LOCALE}. */ s++; return s; case 'F': @@ -3397,7 +3398,7 @@ for (; argc > 0; argc--,argv++) { SV *sv = newSVpv(argv[0],0); av_push(GvAVn(PL_argvgv),sv); - if (PL_widesyscalls) + if (PL_wantutf8) (void)sv_utf8_decode(sv); } } ```

==== //depot/perl/perlapi.h#78 - /u/vieraat/vieraat/jhi/pp4/perl/perlapi.h ==== Index​: perl/perlapi.h

Inline Patch ```diff --- perl/perlapi.h.~1~ Tue Jan 14 17:29:10 2003 +++ perl/perlapi.h Tue Jan 14 17:29:10 2003 @@ -584,14 +584,14 @@ #define PL_utf8_upper (*Perl_Iutf8_upper_ptr(aTHX)) #undef PL_utf8_xdigit #define PL_utf8_xdigit (*Perl_Iutf8_xdigit_ptr(aTHX)) +#undef PL_utf8locale +#define PL_utf8locale (*Perl_Iutf8locale_ptr(aTHX)) #undef PL_uudmap #define PL_uudmap (*Perl_Iuudmap_ptr(aTHX)) #undef PL_wantutf8 #define PL_wantutf8 (*Perl_Iwantutf8_ptr(aTHX)) #undef PL_warnhook #define PL_warnhook (*Perl_Iwarnhook_ptr(aTHX)) -#undef PL_widesyscalls -#define PL_widesyscalls (*Perl_Iwidesyscalls_ptr(aTHX)) #undef PL_xiv_arenaroot #define PL_xiv_arenaroot (*Perl_Ixiv_arenaroot_ptr(aTHX)) #undef PL_xiv_root ```

==== //depot/perl/pod/perlrun.pod#67 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perlrun.pod ==== Index​: perl/pod/perlrun.pod

Inline Patch ```diff --- perl/pod/perlrun.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perlrun.pod Tue Jan 14 17:29:10 2003 @@ -266,11 +266,21 @@ =item B<-C> -enables Perl to use the native wide character APIs on the target system. -The magic variable C<${^WIDE_SYSTEM_CALLS}> reflects the state of -this switch. See L. +enables Perl to use the Unicode APIs on the target system. -This feature is currently only implemented on the Win32 platform. +As of Perl 5.8.1, if C<-C> is used and the locale settings (the LC_ALL, +LC_CTYPE, and LANG environment variables) indicate a UTF-8 locale, +the STDIN is expected to be in UTF-8, the STDOUT and STDERR are +expected to be in UTF-8, and C<:utf8> is the default file open layer. +See L, L, and L for more information. +The magic variable C<${^UTF8_LOCALE}> reflects this state, +see L. (Another way of setting this +variable is to set the environment variable PERL_UTF8_LOCALE.) + +(In Perls earlier than 5.8.1 the C<-C> switch was a Win32-only switch +that enabled the use of Unicode-aware "wide system call" Win32 APIs. +This feature was practically unused, however, and the command line +switch was therefore "recycled".) =item B<-c> ```

==== //depot/perl/pod/perlunicode.pod#113 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perlunicode.pod ==== Index​: perl/pod/perlunicode.pod

Inline Patch ```diff --- perl/pod/perlunicode.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perlunicode.pod Tue Jan 14 17:29:10 2003 @@ -67,13 +67,6 @@ external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text. -On Windows platforms, if the C<-C> command line switch is used or the -${^WIDE_SYSTEM_CALLS} global flag is set to C<1>, all system calls -will use the corresponding wide-character APIs. This feature is -available only on Windows to conform to the API standard already -established for that platform--and there are very few non-Windows -platforms that have Unicode-aware APIs. - The C pragma will always, regardless of platform, force byte semantics in a particular lexical scope. See L. @@ -1050,10 +1043,14 @@ =item * -If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG) -contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), -the default encodings of your STDIN, STDOUT, and STDERR, and of -B, are considered to be UTF-8. +If your locale environment variables (LC_ALL, LC_CTYPE, LANG) +contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) +B you enable using UTF-8 either by using the C<-C> command line +switch or setting the PERL_UTF8_LOCALE environment variable to a true +value, then the default encodings of your STDIN, STDOUT, and STDERR, +and of B, are considered to be UTF-8. +See L, L, and L for more +information. The magic variable C<${^UTF8_LOCALE}> will also be set. =item * @@ -1410,6 +1407,6 @@ =head1 SEE ALSO L, L, L, L, L, L, -L, L +L, L =cut ```

==== //depot/perl/pod/perluniintro.pod#44 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perluniintro.pod ==== Index​: perl/pod/perluniintro.pod

Inline Patch ```diff --- perl/pod/perluniintro.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perluniintro.pod Tue Jan 14 17:29:10 2003 @@ -172,13 +172,15 @@ to this sample program ensures that the output is completely UTF-8, and removes the program's warning. -If your locale environment variables (C, C, -C, C) contain the strings 'UTF-8' or 'UTF8', -regardless of case, then the default encoding of your STDIN, STDOUT, -and STDERR and of B, is UTF-8. Note that -this means that Perl expects other software to work, too: if Perl has -been led to believe that STDIN should be UTF-8, but then STDIN coming -in from another command is not UTF-8, Perl will complain about the +If your locale environment variables (C, C, C) +contain the strings 'UTF-8' or 'UTF8' (matched case-insensitively) +B you enable using UTF-8 either by using the C<-C> command line +switch or by setting the PERL_UTF8_LOCALE environment variable to +a true value, then the default encoding of your STDIN, STDOUT, and +STDERR, and of B, is UTF-8. Note that this +means that Perl expects other software to work, too: if Perl has been +led to believe that STDIN should be UTF-8, but then STDIN coming in +from another command is not UTF-8, Perl will complain about the malformed UTF-8. All features that combine Unicode and I/O also require using the new ```

==== //depot/perl/pod/perlvar.pod#111 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perlvar.pod ==== Index​: perl/pod/perlvar.pod

Inline Patch ```diff --- perl/pod/perlvar.pod.~1~ Tue Jan 14 17:29:10 2003 +++ perl/pod/perlvar.pod Tue Jan 14 17:29:10 2003 @@ -1109,6 +1109,16 @@ B<-T>), 0 for off, -1 when only taint warnings are enabled (i.e. with B<-t> or B<-TU>). This variable is read-only. +=item ${^UTF8_LOCALE} + +Reflects whether the locale settings indicated the use of UTF-8 and that +the use of UTF-8 was enabled either by the C<-C> command line switch or +by setting the PERL_UTF8_LOCALE environment variable to a true value. +This variable is read-only. If true, the STDIN is expected to be in +UTF-8, the STDOUT and STDERR are in UTF-8, and C<:utf8> is the default +file open layer. See L, L, and L +for more information. + =item $PERL_VERSION =item $^V @@ -1148,21 +1158,6 @@ The current set of warning checks enabled by the C pragma. See the documentation of C for more details. -=item ${^WIDE_SYSTEM_CALLS} - -Global flag that enables system calls made by Perl to use wide character -APIs native to the system, if available. This is currently only implemented -on the Windows platform. - -This can also be enabled from the command line using the C<-C> switch. - -The initial value is typically C<0> for compatibility with Perl versions -earlier than 5.6, but may be automatically set to C<1> by Perl if the system -provides a user-settable default (e.g., C<$ENV{LC_CTYPE}>). - -The C pragma always overrides the effect of this flag in the current -lexical scope. See L. - =item $EXECUTABLE_NAME =item $^X ```

End of Patch.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From @rgs

Jarkko Hietaniemi \jhi@&#8203;iki\.fi wrote​:

==== //depot/perl/locale.c#10 - /u/vieraat/vieraat/jhi/pp4/perl/locale.c ==== Index​: perl/locale.c --- perl/locale.c.~1~ Tue Jan 14 17​:29​:10 2003 +++ perl/locale.c Tue Jan 14 17​:29​:10 2003 ... @​@​ -487\,37 +487\,44 @​@​ it overrides LC_MESSAGES for GNU gettext\, and it also can have more than one locale\, separated by spaces\, in case you need to know.) - If PL_wantutf8 is true\, perl.c​:S_parse_body() - will turn on the PerlIO :utf8 discipline on STDIN\, STDOUT\, - STDERR\, _and_ the default open discipline. + If PL_utf8locale and PL_wantutf8 (set by -C) are true\, + perl.c​:S_parse_body() will turn on the PerlIO :utf8 layer + on STDIN\, STDOUT\, STDERR\, _and_ the default open discipline. */ - bool wantutf8 = FALSE; + bool utf8locale = FALSE; char *codeset = NULL; #if defined(HAS_NL_LANGINFO) && defined(CODESET) codeset = nl_langinfo(CODESET); #endif if (codeset) - wantutf8 = (ibcmp(codeset\, "UTF-8"\, 5) == 0 || - ibcmp(codeset\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(codeset\, "UTF-8"\, 5) == 0 || + ibcmp(codeset\, "UTF8"\, 4) == 0); #if defined(USE_LOCALE) else { /* nl_langinfo(CODESET) is supposed to correctly * interpret the locale environment variables\, * but just in case it fails\, let's do this manually. */ if (lang) - wantutf8 = (ibcmp(lang\, "UTF-8"\, 5) == 0 || - ibcmp(lang\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(lang\, "UTF-8"\, 5) == 0 || + ibcmp(lang\, "UTF8"\, 4) == 0); #ifdef USE_LOCALE_CTYPE if (curctype) - wantutf8 = (ibcmp(curctype\, "UTF-8"\, 5) == 0 || - ibcmp(curctype\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(curctype\, "UTF-8"\, 5) == 0 || + ibcmp(curctype\, "UTF8"\, 4) == 0); #endif if (lc_all) - wantutf8 = (ibcmp(lc_all\, "UTF-8"\, 5) == 0 || - ibcmp(lc_all\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(lc_all\, "UTF-8"\, 5) == 0 || + ibcmp(lc_all\, "UTF8"\, 4) == 0); #endif /* USE_LOCALE */ }   ^ I suggest to move this closing bracket one line up\, just before the #endif\, (I suspect that building bleadperl with -DNO_LOCALE is currently broken)

- if (wantutf8) - PL_wantutf8 = TRUE; + if (utf8locale) + PL_utf8locale = TRUE; + } + /* Set PL_wantutf8 to $ENV{PERL_UTF8_LOCALE} if using PerlIO. + This is an alternative to using the -C command line switch + (the -C if present will override this). */ + { + char *p = PerlEnv_getenv("PERL_UTF8_LOCALE"); + PL_wantutf8 = p ? (bool) atoi(p) : FALSE; } #endif

p5pRT commented 21 years ago

From goldbb2@earthlink.net

Jarkko Hietaniemi wrote​:

In our previous episode we found out that there were two problems inherent in the implicit UTF-8-ification​:

(1) The UTF-8 kicked in even when the user didn't ask for it. Lots of people using RH 8.0 have been bitten by this because the default locales are UTF-8. [snip] So the issue (1) still would remain but the following patch attempts to rectify the situation\, by making the UTF-8-ification explicit instead of implicit.

I've a foolish question -- if\, on *nix\, we're making handles binary by default\, and text mode only when asked\, then ought we do the same on windows? No\, I don't think we actually should\, but it would be logically consistant :-).

Perhaps more important -- on redhat8\, if we write a file in latin1\, or any other non-utf8 mode\, how will that file treated by other utilities?

On windows\, for example\, if we write a file in binary mode\, and print out mere "\n" chars between lines\, then some utilities will break\, due to them expecting CRLF.

I fear (but don't know for certain) that if we produce files with bytes whose high bits are set\, and which aren't properly formed utf8\, then at least some text processing utilities will whine about malformed utf8 characters.

-- $..='(?​:(?{local$^C=$^C|'.(1\<\<$_).'})|)'for+a..4; $..='(?{print+substr"\n !\,$^C\,1 if $^C\<26})(?!)'; $.=~s'!'haktrsreltanPJ\,r coeueh"';BEGIN{${"\cH"} |=(1\<\<21)}""=~$.;qw(Just another Perl hacker\,\n);

p5pRT commented 21 years ago

From @jhi

I've a foolish question -- if\, on *nix\, we're making handles binary by default\, and text mode only when asked\, then ought we do the same on

We're *returning* handles binary by default on *Nix.

windows? No\, I don't think we actually should\, but it would be logically consistant :-).

The Win32 way is not to be logically consistent :-)

Perhaps more important -- on redhat8\, if we write a file in latin1\, or any other non-utf8 mode\, how will that file treated by other utilities?

Judging by users' experience I would say they are not yet expecting magical UTF-8-ification.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From @jhi

#endif /* USE_LOCALE */ } ^ I suggest to move this closing bracket one line up\, just before the #endif\, (I suspect that building bleadperl with -DNO_LOCALE is currently broken)

Ahhh\, good catch.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From @jhi

I now applied the patch (with the } moved as noted by rgs) to bleadperl as #18490. As noted earlier\, this means that UTF-8 locales don't anymore automagically cause all I/O to be in UTF-8.

(I've got a patch coming up which will hopefully address the UTF-8 tr/// test breakage caused by Encode 1.84 -- which fixed the other half of the problem\, that illegal UTF-8 wasn't detected immediately when read in.)

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From abe@ztreet.demon.nl

Op een zonnige winterdag (Tuesday 14 January 2003 16​:49)\, schreef Jarkko Hietaniemi​:

[snip]

In a perverse way going explicit is bad news since the implicit UTF-8-ification has certainly shaken many evil bugs out of the 5.8.0 tree (the B0B bug comes to mind\, for example). Maybe for those platforms that have UTF-8 locales a new column of smoke testing (with env PERL_UTF8_LOCALE=1 LC_ALL=xx_YY.UTF-8) would be in order.

I've been thinking about this and did some testing (on SuSE 8.0 with heaps of locales).

1) perl-5.8.x will not build for me with PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8

  $ cd perl-5.8.x   $ PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8 ./Configure -des -Dusedevel   $ PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8 make

Dies on building Digest​::MD5

2) perl-current with those locale settings gives​:

  $ cd t   $ PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8 ./perl harness Failed Test Stat Wstat Total Fail Failed List of Failed


../ext/Encode/t/CJKT.t 21 5376 42 21 50.00% 1-3 7-9 13-15 19-21   25-27 31-33 37-39 49 tests and 413 subtests skipped. Failed 1/754 test scripts\, 99.87% okay. 21/70272 subtests failed\, 99.97% okay.

3) Do we have enough data if we *only* run 'make test' with locale settings? 4) Shouldn't we test both '​:stdio' and '​:perlio' layers under locale? 5) If not\, is '​:perlio' the one to choose or should we just use an empty $ENV{PERLIO}?

Ready to hack on the Test​::Smoke suite\, I would like to have this in 1.17

Good luck\,

Abe -- I think this requires more thought\, therefore I'm excising the "promise" from perldelta and replacing it with more non-committal mumbling.   -- Jarkko Hietaniemi on p5p @​ 2002-05-27

p5pRT commented 21 years ago

From @jhi

I've been thinking about this and did some testing (on SuSE 8.0 with heaps of locales).

1) perl-5.8.x will not build for me with PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8

    $ cd perl\-5\.8\.x
$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./Configure \-des \-Dusedevel

I don't think here the PERL_UTF8_LOCALE=1 will be of much use since there is no Perl\, yet.

$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 make

Note that the name of the UTF-8 locales varies from OS to OS\, so don't hardwire anything to the smoke scripts.

Dies on building Digest​::MD5

Hmmm\, will take a look.

2) perl-current with those locale settings gives​:

Ummm\, what do you mean by 'perl-5.8.x' as opposed to 'perl-current'?

    $ cd t
$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./perl harness

Failed Test Stat Wstat Total Fail Failed List of Failed ------------------------------------------------------------------------------- ../ext/Encode/t/CJKT.t 21 5376 42 21 50.00% 1-3 7-9 13-15 19-21 25-27 31-33 37-39 49 tests and 413 subtests skipped. Failed 1/754 test scripts\, 99.87% okay. 21/70272 subtests failed\, 99.97% okay.

3) Do we have enough data if we *only* run 'make test' with locale settings?

'make test' as opposed to 'harness'? Of course\, in general\, the harness gives more info.

4) Shouldn't we test both '​:stdio' and '​:perlio' layers under locale?

I would have thought that the '​:stdio' has no chance of doing much of the Unicode I/O tests anyway\, so in the best case we'll just get a lot of skipped tests (in the worse cases we'll get a lot of tests failing to run).

5) If not\, is '​:perlio' the one to choose or should we just use an empty $ENV{PERLIO}?

In other but related news\, I now if(0)'ed the "widesyscalls" code\, so that Win32 builds should work again.

-- Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From abe@ztreet.demon.nl

Op een zonnige winterdag (Thursday 16 January 2003 21​:40)\, schreef Jarkko Hietaniemi​:

I've been thinking about this and did some testing (on SuSE 8.0 with heaps of locales).

1) perl-5.8.x will not build for me with PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8

    $ cd perl\-5\.8\.x
$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./Configure \-des \-Dusedevel

I don't think here the PERL_UTF8_LOCALE=1 will be of much use since there is no Perl\, yet.

Okay\, that is silly then.

$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 make

Note that the name of the UTF-8 locales varies from OS to OS\, so don't hardwire anything to the smoke scripts.

Nope\, it will be configurable\, as long as we can stick to $ENV{LC_ALL}.

Dies on building Digest​::MD5

Hmmm\, will take a look.

2) perl-current with those locale settings gives​:

Ummm\, what do you mean by 'perl-5.8.x' as opposed to 'perl-current'?

I thought that

  rsync ftp.linux.activestate.com​::perl-5.8.x

was 5.8-maint (your tree\, 5.8.1 to be)

and that

  rsync ftp.linux.activestate.com​::perl-current

was blead (Hugo's tree\, 5.10.0 to be)

(That is how I did the [5.8.0] smoke and manual testing)

    $ cd t
$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./perl harness

Failed Test Stat Wstat Total Fail Failed List of Failed ------------------------------------------------------------------------- ------ ../ext/Encode/t/CJKT.t 21 5376 42 21 50.00% 1-3 7-9 13-15 19-21 25-27 31-33 37-39 49 tests and 413 subtests skipped. Failed 1/754 test scripts\, 99.87% okay. 21/70272 subtests failed\, 99.97% okay.

3) Do we have enough data if we *only* run 'make test' with locale settings?

'make test' as opposed to 'harness'? Of course\, in general\, the harness gives more info.

No I actually meant 'make test' as opposed to 'make' *and* 'make test'\, since there seems to be a difference (for 5.8.x)

4) Shouldn't we test both '​:stdio' and '​:perlio' layers under locale?

I would have thought that the '​:stdio' has no chance of doing much of the Unicode I/O tests anyway\, so in the best case we'll just get a lot of skipped tests (in the worse cases we'll get a lot of tests failing to run).

5) If not\, is '​:perlio' the one to choose or should we just use an empty $ENV{PERLIO}?

Hmmm...\, I'll try with locale settings and $ENV{PERLIO} unset.

You might see a smoke-report with this over the weekend. Just let me know what you want and I'll try and implement it.

In other but related news\, I now if(0)'ed the "widesyscalls" code\, so that Win32 builds should work again.

Good luck\,

Abe -- Tim Bunce> Here's an easy fix​: deprecate them.

*All* of them? \<finger wavering over the red big button INITIATE VSTRINGS ANNIHILATION>   -- Jarkko Hietaniemi on p5p @​ 2001-11-14

p5pRT commented 21 years ago

From @jhi

Ummm\, what do you mean by 'perl-5.8.x' as opposed to 'perl-current'?

I thought that

rsync ftp\.linux\.activestate\.com&#8203;::perl\-5\.8\.x

was 5.8-maint (your tree\, 5.8.1 to be)

and that

rsync ftp\.linux\.activestate\.com&#8203;::perl\-current

was blead (Hugo's tree\, 5.10.0 to be)

That's right. But note that for the moment I'm working the Unicode changes first into the bleadperl\, the maintperl is lagging a bit. I'll try syncing them before Saturday.

Jarkko Hietaniemi \jhi@&#8203;iki\.fi http​://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

p5pRT commented 21 years ago

From @jhi

I think this can now be considered resolved since there's no more implicit UTF-8-ification.

p5pRT commented 21 years ago

@jhi - Status changed from 'new' to 'resolved'