Closed p5pRT closed 21 years ago
As discussed in the perl-unicode@perl.org mailing list (Subject: CGI and UTF\, see http://archive.develooper.com/perl-unicode@perl.org/) and argued by Benjamin Franz \snowhare@​nihongo\.org\, the implicit turning on of UTF-8-ness on filehandles based on locale setup can cause nasty action-at-distance messups.
The crux of the matter seems to be that reading in illegal UTF-8 data does not trigger any warnings\, but only later trying to use the data does. In this example (should work in Linuxes with utf8 locales installed) it's the ord() that gets punished\, not the \<>. Now imagine a few hundred lines of code between the \<> and the ord() and you'll see the "distance-in-action".
$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -le '$a=\
I don't know yet what's the best way to solve this without slowing down all I/O (well\, just "I") too much. Benjamin suggests either a (mandatory?) warning when the UTF-8-ness is switched on filehandles because of locale setting\, or some explicit switch to enable the feature. But in any case\, the issue has now been recorded.
Another way of looking at this is that the behaviour of switching on UTF-8-ness based on locale *WITHOUT* the user having said 'use locale' is unprecedented. Maybe 'use locale' should be required\, or maybe 'use locale "utf8"'\, or something completely different\, like some PERL_FOOBAR environment variable?
jhi and Porters\,
A happy new year.
On Monday\, Jan 6\, 2003\, at 02:12 Asia/Tokyo\, Jarkko Hietaniemi (via RT) wrote:
# New Ticket Created by Jarkko Hietaniemi # Please include the string: [perl #19743] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt2/Ticket/Display.html?id=19743 >
This is a bug report for perl from jhi@ugli.hut.fi\, generated with the help of perlbug 1.34 running under perl v5.8.0.
----------------------------------------------------------------- [Please enter your report here]
As discussed in the perl-unicode@perl.org mailing list (Subject: CGI and UTF\, see http://archive.develooper.com/perl-unicode@perl.org/) and argued by Benjamin Franz \snowhare@​nihongo\.org\, the implicit turning on of UTF-8-ness on filehandles based on locale setup can cause nasty action-at-distance messups.
The crux of the matter seems to be that reading in illegal UTF-8 data does not trigger any warnings\, but only later trying to use the data does. In this example (should work in Linuxes with utf8 locales installed) it's the ord() that gets punished\, not the \<>. Now imagine a few hundred lines of code between the \<> and the ord() and you'll see the "distance-in-action".
$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -le '$a=\
;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line > 1. 0
I am still away from home where I can test various perl builds further so this reply is not definitive. But the quick test shows that upgrading Encode should solve this problem. See this.
% perl -MEncode -e 'print Encode->VERSION\, "\n"' 1.83
perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 perl -le '$a=\
;print ord($a)' perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LC_ALL = "en_US.utf8"\, LANG = (unset) are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). 255
Since this is on MacOS X 10.2.3\, Setting locale fails as expected hence warning but ${^ENCODING} is successfully set to utf8 and you get the correct result.
Encode prior to version 1.80 had problem with 'use encoding "utf8"' and that methinks is the cause of problem
Dan the Encode Maintainer
I am still away from home where I can test various perl builds further so this reply is not definitive. But the quick test shows that upgrading Encode should solve this problem. See this.
% perl -MEncode -e 'print Encode->VERSION\, "\n"' 1.83
I don't think this is the issue here at all. People are complaining basically about two things:
(1) They don't like the feature of /utf-?8/i in the locale setup turning on silently the utf8ness of filehandles (thus effectively breaking any "binary" filehandles on existing code)\, ESPECIALLY because they never said 'use locale' (2) That reading in illegal UTF-8 (like the byte 255) won't barf when the input happens\, only later if the malformed data is being used.
This:
$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -Ilib -le '$a=\
happens *with* Encode 1.83 in Linux.
perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 perl -le '$a=\
;print ord($a)' perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LC_ALL = "en_US.utf8"\, LANG = (unset) are supported and installed on your system. perl: warning: Falling back to the standard locale ("C"). 255 Since this is on MacOS X 10.2.3\, Setting locale fails as expected hence warning but ${^ENCODING} is successfully set to utf8 and you get the correct result.
Encode prior to version 1.80 had problem with 'use encoding "utf8"' and that methinks is the cause of problem
Dan the Encode Maintainer
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
On Tuesday\, Jan 7\, 2003\, at 13:13 Asia/Tokyo\, Jarkko Hietaniemi wrote:
I don't think this is the issue here at all. People are complaining basically about two things:
(1) They don't like the feature of /utf-?8/i in the locale setup turning on silently the utf8ness of filehandles (thus effectively breaking any "binary" filehandles on existing code)\, ESPECIALLY because they never said 'use locale'
Okay\, this one is beyond the responsibility of (Encode|encoding).pm since ${^ENCODING} is set by perl core. But I don't like making Linux or other locale-savvy environments an exceptional case....
(2) That reading in illegal UTF-8 (like the byte 255) won't barf when the input happens\, only later if the malformed data is being used.
Sounds like conflicting request against (1). To meet (1) you have to turn ${^ENCODING} off but to meet (2) you have to turn ${^ENCODING} on to have Encode detect malformed data.
This:
$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -Ilib -le '$a=\
;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line > 1. 0 happens *with* Encode 1.83 in Linux.
Sounds like you need a working locale to duplicate the problem.
Anyway\, have a careful look at the error message; "at -e line 1\,
\
IMHO\, the easiest way to solve the problem is to set ${^ENCODING} when and only when
0. locale is set on environmnet 1. "use locale" is EXPLICITY set.
So far "use locale" is implicitly done. What does
env LC_ALL=en_US.utf8 ./perl -Ilib -e 'print ${^ENCODING}->name'
say? It should print 'utf8' where en_US.utf8 works and error where not.
Dan the Encode Maintainer
On Tue\, Jan 07\, 2003 at 01:42:39PM +0900\, Dan Kogai wrote:
On Tuesday\, Jan 7\, 2003\, at 13:13 Asia/Tokyo\, Jarkko Hietaniemi wrote:
$ ./perl -e 'print chr(255)' | env LC_ALL=en_US.utf8 ./perl -Ilib -le '$a=\
;print ord($a)' Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xff) in ord at -e line 1\, \ line > 1. 0 happens *with* Encode 1.83 in Linux.
Sounds like you need a working locale to duplicate the problem.
Anyway\, have a careful look at the error message; "at -e line 1\, \
line 1."; does that mean perl barfs whenever the input happens? If perl barfs when the malformed data is being used it isn't supposed to barf on \ . It looks to me that (2) is already solved.
Look closer:
"Malformed UTF-8 character ... in ord"
The error occurs when the data is used in ord()\, not when it is read in.
"... \
% echo 1 | perl -we '$_ = \
Ronald
$ env LC_ALL=en_US.utf8 ./perl -Ilib -e 'print ${^ENCODING}->name' Can't call method "name" on an undefined value at -e line 1.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
Sounds like you need a working locale to duplicate the problem.
Yes\, a working UTF-8 locale.
IMHO\, the easiest way to solve the problem is to set ${^ENCODING} when and only when
0. locale is set on environmnet 1. "use locale" is EXPLICITY set.
It would seem that currently ${^ENCODING} is not set at all by this UTF-8 locale thing (only ${^OPEN} is\, see perl.c at about line 1520 or so). That would explain the problem #2.
So far "use locale" is implicitly done. What does
I think arguably people could say that by 'use locale' they meant just the old byte-based locale\, none of this fancy Unicode stuff. Maybe we should go e.g. 'use locale "utf8"' to enable this new feature.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
In article \20030107134029\.GH285996@​lyta\.hut\.\_i\, Jarkko Hietaniemi \jhi@​iki\.fi writes:
I think arguably people could say that by 'use locale' they meant just the old byte-based locale\, none of this fancy Unicode stuff. Maybe we should go e.g. 'use locale "utf8"' to enable this new feature.
(I'm coming into this in the middle and my utf-8 knowledge is pretty shaky\, so forgive me if I'm missing philosophies already in place. I'll boldly go where angels fear to tread anyways)
mm\, is that needed ? if you run a "use locale" program\, you're already supposed to care a lot about whats in the locale environment vars. If these contain utf-8\, that's hardly by accident. Basically it's not the programmer but the user that runs the program that knows if his files are in utf8. So it sees to me that the way to read files should by default come from the environment\, not the code.
So I suppose what you're discussing here is the conundrum for the *programmer*. He wants to allow the user to indeed control how his textfiles are read\, but if he just writes "use locale"\, he not only gets potential utf8 chars\, he also changes the semantics of string ops to use all these fancy region features. And not only that\, he may also get currency symbols and date formats etc. But is it so terrible to force a programmer who cares enough to allow his users the full unicode charset to also allow the user to specify the region semantics ? It's hardly makes sense to e.g. have utf-8 files and still use /[a-zA-Z_]/ instead of /\w/ in your regexes. I think that's in general users who care enough about international characters to set their locale also want to get their preferred region semantics (at least for the character related part of the local)\, so we should encourage programmers to give it to them.
Still\, perl not being about bondage\, it of course still makes sense to allow the programmer to ONLY set "i want to read files as utf-8 IF the user specified a utf-8 style locale"\, but that should be an exceptional thing and get a long name to stress that.
So it makes sense that normaly you're supposed to use 'use locale' in programs that will do "utf-8 on filehandles by default if so specified by the user". And for example use 'use locale "utf8ness_only"' if the programmer abolutely does not want to think about region semantics. And maybe something like 'use locale "string_stuff_only"' if you want user specied utf8-ness\, right collation and \w and the like\, but no mangled dates.
PS: I don't notice any discussion about the scope of whatever solution is chosen. Lexically scoped seems most sane\, but what if you want to use some module that opens a file for you by proxy (e.g. File::Tail) ? Especially if that module up to now worked for both text and binary files.
One half of this problem has now been fixed (by Encode 1.84)\, illegal UTF-8 will now warn (-w) immediately when read in.
For the second half\, the implicit UTF-8-ification\, Sarathy suggests extending the semantics of the -C switch (currently only meaningful in Win32 platforms). (Sarathy also points out that the lexical semantics of 'use locale' really wouldn't work out that well with the global effects of the UTF-8-ification.) In other words\, one would need to use -C to get the utf-8-fy locale settings affecting the I/O layers.
In our previous episode we found out that there were two problems inherent in the implicit UTF-8-ification:
(1) The UTF-8 kicked in even when the user didn't ask for it. Lots of people using RH 8.0 have been bitten by this because the default locales are UTF-8.
(2) Even when and if the user wanted it\, reading in malformed UTF-8 didn't do anything *immediately*. It was only later when and if further operations were attempted on the malformed data that the sad state was detected.
The issue (2) was fixed by Encode 1.84\, now the \<> (et alia) detect the evil data. (Though some further hacking may be required\, a single UTF-8 tr/// test was broken by the Encode 1.84.)
So the issue (1) still would remain but the following patch attempts to rectify the situation\, by making the UTF-8-ification explicit instead of implicit.
This patch (inlined since last time something ate my attachment) hijacks the -C switch (as suggested by Sarathy) to do the enabling of UTF-8-fied I/O. So no more implicit UTF-8 based on locale settings. (Use of the locale pragma wouldn't have worked that well since it is lexical in scope\, while the UTF-8 decision is rather global in scope.) I added also an alternative way of enabling this feature: setting the $ENV{PERL_UTF8_LOCALE} to true (the -C\, if present\, wins).
In a perverse way going explicit is bad news since the implicit UTF-8-ification has certainly shaken many evil bugs out of the 5.8.0 tree (the B0B bug comes to mind\, for example). Maybe for those platforms that have UTF-8 locales a new column of smoke testing (with env PERL_UTF8_LOCALE=1 LC_ALL=xx_YY.UTF-8) would be in order.
==== //depot/perl/embedvar.h#156 - /u/vieraat/vieraat/jhi/pp4/perl/embedvar.h ==== Index: perl/embedvar.h
==== //depot/perl/gv.c#178 - /u/vieraat/vieraat/jhi/pp4/perl/gv.c ==== Index: perl/gv.c
==== //depot/perl/intrpvar.h#112 - /u/vieraat/vieraat/jhi/pp4/perl/intrpvar.h ==== Index: perl/intrpvar.h
==== //depot/perl/locale.c#10 - /u/vieraat/vieraat/jhi/pp4/perl/locale.c ==== Index: perl/locale.c
==== //depot/perl/mg.c#246 - /u/vieraat/vieraat/jhi/pp4/perl/mg.c ==== Index: perl/mg.c
==== //depot/perl/perl.c#461 - /u/vieraat/vieraat/jhi/pp4/perl/perl.c ==== Index: perl/perl.c
==== //depot/perl/perlapi.h#78 - /u/vieraat/vieraat/jhi/pp4/perl/perlapi.h ==== Index: perl/perlapi.h
==== //depot/perl/pod/perlrun.pod#67 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perlrun.pod ==== Index: perl/pod/perlrun.pod
==== //depot/perl/pod/perlunicode.pod#113 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perlunicode.pod ==== Index: perl/pod/perlunicode.pod
==== //depot/perl/pod/perluniintro.pod#44 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perluniintro.pod ==== Index: perl/pod/perluniintro.pod
==== //depot/perl/pod/perlvar.pod#111 - /u/vieraat/vieraat/jhi/pp4/perl/pod/perlvar.pod ==== Index: perl/pod/perlvar.pod
End of Patch.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
Jarkko Hietaniemi \jhi@​iki\.fi wrote:
==== //depot/perl/locale.c#10 - /u/vieraat/vieraat/jhi/pp4/perl/locale.c ==== Index: perl/locale.c --- perl/locale.c.~1~ Tue Jan 14 17:29:10 2003 +++ perl/locale.c Tue Jan 14 17:29:10 2003 ... @@ -487\,37 +487\,44 @@ it overrides LC_MESSAGES for GNU gettext\, and it also can have more than one locale\, separated by spaces\, in case you need to know.) - If PL_wantutf8 is true\, perl.c:S_parse_body() - will turn on the PerlIO :utf8 discipline on STDIN\, STDOUT\, - STDERR\, _and_ the default open discipline. + If PL_utf8locale and PL_wantutf8 (set by -C) are true\, + perl.c:S_parse_body() will turn on the PerlIO :utf8 layer + on STDIN\, STDOUT\, STDERR\, _and_ the default open discipline. */ - bool wantutf8 = FALSE; + bool utf8locale = FALSE; char *codeset = NULL; #if defined(HAS_NL_LANGINFO) && defined(CODESET) codeset = nl_langinfo(CODESET); #endif if (codeset) - wantutf8 = (ibcmp(codeset\, "UTF-8"\, 5) == 0 || - ibcmp(codeset\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(codeset\, "UTF-8"\, 5) == 0 || + ibcmp(codeset\, "UTF8"\, 4) == 0); #if defined(USE_LOCALE) else { /* nl_langinfo(CODESET) is supposed to correctly * interpret the locale environment variables\, * but just in case it fails\, let's do this manually. */ if (lang) - wantutf8 = (ibcmp(lang\, "UTF-8"\, 5) == 0 || - ibcmp(lang\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(lang\, "UTF-8"\, 5) == 0 || + ibcmp(lang\, "UTF8"\, 4) == 0); #ifdef USE_LOCALE_CTYPE if (curctype) - wantutf8 = (ibcmp(curctype\, "UTF-8"\, 5) == 0 || - ibcmp(curctype\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(curctype\, "UTF-8"\, 5) == 0 || + ibcmp(curctype\, "UTF8"\, 4) == 0); #endif if (lc_all) - wantutf8 = (ibcmp(lc_all\, "UTF-8"\, 5) == 0 || - ibcmp(lc_all\, "UTF8"\, 4) == 0); + utf8locale = (ibcmp(lc_all\, "UTF-8"\, 5) == 0 || + ibcmp(lc_all\, "UTF8"\, 4) == 0); #endif /* USE_LOCALE */ } ^ I suggest to move this closing bracket one line up\, just before the #endif\, (I suspect that building bleadperl with -DNO_LOCALE is currently broken)
- if (wantutf8) - PL_wantutf8 = TRUE; + if (utf8locale) + PL_utf8locale = TRUE; + } + /* Set PL_wantutf8 to $ENV{PERL_UTF8_LOCALE} if using PerlIO. + This is an alternative to using the -C command line switch + (the -C if present will override this). */ + { + char *p = PerlEnv_getenv("PERL_UTF8_LOCALE"); + PL_wantutf8 = p ? (bool) atoi(p) : FALSE; } #endif
Jarkko Hietaniemi wrote:
In our previous episode we found out that there were two problems inherent in the implicit UTF-8-ification:
(1) The UTF-8 kicked in even when the user didn't ask for it. Lots of people using RH 8.0 have been bitten by this because the default locales are UTF-8. [snip] So the issue (1) still would remain but the following patch attempts to rectify the situation\, by making the UTF-8-ification explicit instead of implicit.
I've a foolish question -- if\, on *nix\, we're making handles binary by default\, and text mode only when asked\, then ought we do the same on windows? No\, I don't think we actually should\, but it would be logically consistant :-).
Perhaps more important -- on redhat8\, if we write a file in latin1\, or any other non-utf8 mode\, how will that file treated by other utilities?
On windows\, for example\, if we write a file in binary mode\, and print out mere "\n" chars between lines\, then some utilities will break\, due to them expecting CRLF.
I fear (but don't know for certain) that if we produce files with bytes whose high bits are set\, and which aren't properly formed utf8\, then at least some text processing utilities will whine about malformed utf8 characters.
-- $..='(?:(?{local$^C=$^C|'.(1\<\<$_).'})|)'for+a..4; $..='(?{print+substr"\n !\,$^C\,1 if $^C\<26})(?!)'; $.=~s'!'haktrsreltanPJ\,r coeueh"';BEGIN{${"\cH"} |=(1\<\<21)}""=~$.;qw(Just another Perl hacker\,\n);
I've a foolish question -- if\, on *nix\, we're making handles binary by default\, and text mode only when asked\, then ought we do the same on
We're *returning* handles binary by default on *Nix.
windows? No\, I don't think we actually should\, but it would be logically consistant :-).
The Win32 way is not to be logically consistent :-)
Perhaps more important -- on redhat8\, if we write a file in latin1\, or any other non-utf8 mode\, how will that file treated by other utilities?
Judging by users' experience I would say they are not yet expecting magical UTF-8-ification.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
#endif /* USE_LOCALE */ } ^ I suggest to move this closing bracket one line up\, just before the #endif\, (I suspect that building bleadperl with -DNO_LOCALE is currently broken)
Ahhh\, good catch.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
I now applied the patch (with the } moved as noted by rgs) to bleadperl as #18490. As noted earlier\, this means that UTF-8 locales don't anymore automagically cause all I/O to be in UTF-8.
(I've got a patch coming up which will hopefully address the UTF-8 tr/// test breakage caused by Encode 1.84 -- which fixed the other half of the problem\, that illegal UTF-8 wasn't detected immediately when read in.)
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
Op een zonnige winterdag (Tuesday 14 January 2003 16:49)\, schreef Jarkko Hietaniemi:
[snip]
In a perverse way going explicit is bad news since the implicit UTF-8-ification has certainly shaken many evil bugs out of the 5.8.0 tree (the B0B bug comes to mind\, for example). Maybe for those platforms that have UTF-8 locales a new column of smoke testing (with env PERL_UTF8_LOCALE=1 LC_ALL=xx_YY.UTF-8) would be in order.
I've been thinking about this and did some testing (on SuSE 8.0 with heaps of locales).
1) perl-5.8.x will not build for me with PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8
$ cd perl-5.8.x $ PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8 ./Configure -des -Dusedevel $ PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8 make
Dies on building Digest::MD5
2) perl-current with those locale settings gives:
$ cd t $ PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8 ./perl harness Failed Test Stat Wstat Total Fail Failed List of Failed
../ext/Encode/t/CJKT.t 21 5376 42 21 50.00% 1-3 7-9 13-15 19-21 25-27 31-33 37-39 49 tests and 413 subtests skipped. Failed 1/754 test scripts\, 99.87% okay. 21/70272 subtests failed\, 99.97% okay.
3) Do we have enough data if we *only* run 'make test' with locale settings? 4) Shouldn't we test both ':stdio' and ':perlio' layers under locale? 5) If not\, is ':perlio' the one to choose or should we just use an empty $ENV{PERLIO}?
Ready to hack on the Test::Smoke suite\, I would like to have this in 1.17
Good luck\,
Abe -- I think this requires more thought\, therefore I'm excising the "promise" from perldelta and replacing it with more non-committal mumbling. -- Jarkko Hietaniemi on p5p @ 2002-05-27
I've been thinking about this and did some testing (on SuSE 8.0 with heaps of locales).
1) perl-5.8.x will not build for me with PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8
$ cd perl\-5\.8\.x $ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./Configure \-des \-Dusedevel
I don't think here the PERL_UTF8_LOCALE=1 will be of much use since there is no Perl\, yet.
$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 make
Note that the name of the UTF-8 locales varies from OS to OS\, so don't hardwire anything to the smoke scripts.
Dies on building Digest::MD5
Hmmm\, will take a look.
2) perl-current with those locale settings gives:
Ummm\, what do you mean by 'perl-5.8.x' as opposed to 'perl-current'?
$ cd t $ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./perl harness
Failed Test Stat Wstat Total Fail Failed List of Failed ------------------------------------------------------------------------------- ../ext/Encode/t/CJKT.t 21 5376 42 21 50.00% 1-3 7-9 13-15 19-21 25-27 31-33 37-39 49 tests and 413 subtests skipped. Failed 1/754 test scripts\, 99.87% okay. 21/70272 subtests failed\, 99.97% okay.
3) Do we have enough data if we *only* run 'make test' with locale settings?
'make test' as opposed to 'harness'? Of course\, in general\, the harness gives more info.
4) Shouldn't we test both ':stdio' and ':perlio' layers under locale?
I would have thought that the ':stdio' has no chance of doing much of the Unicode I/O tests anyway\, so in the best case we'll just get a lot of skipped tests (in the worse cases we'll get a lot of tests failing to run).
5) If not\, is ':perlio' the one to choose or should we just use an empty $ENV{PERLIO}?
In other but related news\, I now if(0)'ed the "widesyscalls" code\, so that Win32 builds should work again.
-- Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
Op een zonnige winterdag (Thursday 16 January 2003 21:40)\, schreef Jarkko Hietaniemi:
I've been thinking about this and did some testing (on SuSE 8.0 with heaps of locales).
1) perl-5.8.x will not build for me with PERL_UTF8_LOCALE=1 LC_ALL=en_US.utf8
$ cd perl\-5\.8\.x $ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./Configure \-des \-Dusedevel
I don't think here the PERL_UTF8_LOCALE=1 will be of much use since there is no Perl\, yet.
Okay\, that is silly then.
$ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 make
Note that the name of the UTF-8 locales varies from OS to OS\, so don't hardwire anything to the smoke scripts.
Nope\, it will be configurable\, as long as we can stick to $ENV{LC_ALL}.
Dies on building Digest::MD5
Hmmm\, will take a look.
2) perl-current with those locale settings gives:
Ummm\, what do you mean by 'perl-5.8.x' as opposed to 'perl-current'?
I thought that
rsync ftp.linux.activestate.com::perl-5.8.x
was 5.8-maint (your tree\, 5.8.1 to be)
and that
rsync ftp.linux.activestate.com::perl-current
was blead (Hugo's tree\, 5.10.0 to be)
(That is how I did the [5.8.0] smoke and manual testing)
$ cd t $ PERL\_UTF8\_LOCALE=1 LC\_ALL=en\_US\.utf8 \./perl harness
Failed Test Stat Wstat Total Fail Failed List of Failed ------------------------------------------------------------------------- ------ ../ext/Encode/t/CJKT.t 21 5376 42 21 50.00% 1-3 7-9 13-15 19-21 25-27 31-33 37-39 49 tests and 413 subtests skipped. Failed 1/754 test scripts\, 99.87% okay. 21/70272 subtests failed\, 99.97% okay.
3) Do we have enough data if we *only* run 'make test' with locale settings?
'make test' as opposed to 'harness'? Of course\, in general\, the harness gives more info.
No I actually meant 'make test' as opposed to 'make' *and* 'make test'\, since there seems to be a difference (for 5.8.x)
4) Shouldn't we test both ':stdio' and ':perlio' layers under locale?
I would have thought that the ':stdio' has no chance of doing much of the Unicode I/O tests anyway\, so in the best case we'll just get a lot of skipped tests (in the worse cases we'll get a lot of tests failing to run).
5) If not\, is ':perlio' the one to choose or should we just use an empty $ENV{PERLIO}?
Hmmm...\, I'll try with locale settings and $ENV{PERLIO} unset.
You might see a smoke-report with this over the weekend. Just let me know what you want and I'll try and implement it.
In other but related news\, I now if(0)'ed the "widesyscalls" code\, so that Win32 builds should work again.
Good luck\,
Abe -- Tim Bunce> Here's an easy fix: deprecate them.
*All* of them? \<finger wavering over the red big button INITIATE VSTRINGS ANNIHILATION> -- Jarkko Hietaniemi on p5p @ 2001-11-14
Ummm\, what do you mean by 'perl-5.8.x' as opposed to 'perl-current'?
I thought that
rsync ftp\.linux\.activestate\.com​::perl\-5\.8\.x
was 5.8-maint (your tree\, 5.8.1 to be)
and that
rsync ftp\.linux\.activestate\.com​::perl\-current
was blead (Hugo's tree\, 5.10.0 to be)
That's right. But note that for the moment I'm working the Unicode changes first into the bleadperl\, the maintperl is lagging a bit. I'll try syncing them before Saturday.
Jarkko Hietaniemi \jhi@​iki\.fi http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
I think this can now be considered resolved since there's no more implicit UTF-8-ification.
@jhi - Status changed from 'new' to 'resolved'
Migrated from rt.perl.org#19743 (status was 'resolved')
Searchable as RT19743$