Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.92k stars 551 forks source link

uc (and lc) of same character differs if it is utf8 encoded #4150

Closed p5pRT closed 20 years ago

p5pRT commented 23 years ago

Migrated from rt.perl.org#7201 (status was 'resolved')

Searchable as RT7201$

p5pRT commented 23 years ago

From @nwc10

I'm assuming it's a bug that uc() for accented characters in the range 196-255 differs depending on whether they happen to be UTF8 encoded or not. I shouldn't be able to detect the internal state of UTF8 encoding in any way from a perl script\, should I?

The difference is certainly present in 5.6.1\, and I assume is in everything post 5.005.

Is this the suggested way to supply a "test case" with bug reports?

On 5.6.1 and bleadperl the following give ok\, not ok. (ie perl reports that the first two scalars are equal\, yet uc() gives different results)

5.005_03 reports ok\,ok; but uc doesn't change either lower case character\, as 5.005_03 isn't assuming that they are e accutes.

I would expect that Unicode aware perl should give ok\,ok\, but I'm not sure how this is reconciled with the desire to have uc() give the same backwards compatible result as 5.005_03.

#!/usr/local/bin/perl -w

{   my ($e_accute_utf) = my ($e_accute) = chr 0xE9;   $e_accute_utf .= chr 300;   chop $e_accute_utf;   my $E_accute = uc $e_accute;   my $E_accute_utf = uc $e_accute_utf;

  if ($e_accute_utf eq $e_accute) {   print "ok\n";   } else {   print "not ok # '$e_accute_utf' ne '$e_accute'\n";   }   if ($E_accute_utf eq $E_accute) {   print "ok # '$E_accute_utf' eq '$E_accute'\n";   } else {   print "not ok # '$E_accute_utf' ne '$E_accute'\n";   } }

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.7.1: Configured by nclark at Thu Jun 28 09:57:50 BST 2001. Summary of my perl5 (revision 5.0 version 7 subversion 17) configuration: Platform: osname=linux, osvers=2.2.19pre17, archname=i686-linux uname='linux nclark 2.2.19pre17 #2 wed may 2 13:59:30 gmt 2001 i686 unknown ' config_args='-Dusedevel -Dcf_email=nick@talking.bollo.cx -Ubincompat5005 -Uinc_version_list -Uversiononly -Uuselongdouble -Uuse64bitint -de -Dcc=gcc-3.0' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='gcc-3.0', ccflags ='-Wall -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-Wall -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='3.0 20010402 (Debian prerelease)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='gcc-3.0', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldbm -ldb -ldl -lm -lc -lcrypt -lutil perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil libc=/lib/libc-2.2.3.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: DEVEL10995 @INC for perl v5.7.1: /usr/local/lib/perl5/5.7.1/i686-linux /usr/local/lib/perl5/5.7.1 /usr/local/lib/perl5/site_perl/5.7.1/i686-linux /usr/local/lib/perl5/site_perl/5.7.1 /usr/local/lib/perl5/site_perl . Environment for perl v5.7.1: HOME=/home/nclark LANG=C LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/nclark/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11/bin:/usr/bin/X11:/usr/contrib/bin:/usr/games:/usr/sbin:/usr/ucb:/sbin:/usr/etc:/data3/src/emacs/bin/i386-unknown-bsdi2.1/ PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 23 years ago

From @jhi

That was the hope but that was not how it turned out to be. No\, that deal will not be torn open again\, this decision came finally from Rule #1. One is certainly able to find out the UTF8ness\, in various ways.

Then again\, on to your particular bug report\, you might be right in your analysis\, I haven't looked too closely yet.

The difference is certainly present in 5.6.1\, and I assume is in everything post 5.005.

Is this the suggested way to supply a "test case" with bug reports?

On 5.6.1 and bleadperl the following give ok\, not ok. (ie perl reports that the first two scalars are equal\, yet uc() gives different results)

5.005_03 reports ok\,ok; but uc doesn't change either lower case character\, as 5.005_03 isn't assuming that they are e accutes.

I would expect that Unicode aware perl should give ok\,ok\, but I'm not sure how this is reconciled with the desire to have uc() give the same backwards compatible result as 5.005_03.

#!/usr/local/bin/perl -w

{ my ($e_accute_utf) = my ($e_accute) = chr 0xE9; $e_accute_utf .= chr 300; chop $e_accute_utf; my $E_accute = uc $e_accute; my $E_accute_utf = uc $e_accute_utf;

if ($e_accute_utf eq $e_accute) { print "ok\n"; } else { print "not ok # '$e_accute_utf' ne '$e_accute'\n"; } if ($E_accute_utf eq $E_accute) { print "ok # '$E_accute_utf' eq '$E_accute'\n"; } else { print "not ok # '$E_accute_utf' ne '$E_accute'\n"; } }

p5pRT commented 23 years ago

From @nwc10

I'm assuming it's a bug that uc() for accented characters in the range 196-255 differs depending on whether they happen to be UTF8 encoded or not. I shouldn't be able to detect the internal state of UTF8 encoding in any way from a perl script\, should I?

That was the hope but that was not how it turned out to be. No\, that deal will not be torn open again\, this decision came finally from Rule #1. One is certainly able to find out the UTF8ness\, in various ways.

Oops. I didn't mean to restart a messy non-terminating discussion. [I remember some of these threads. There is no right answer]

Then again\, on to your particular bug report\, you might be right in your analysis\, I haven't looked too closely yet.

I was thinking about things like

PP(pp_uc) {   dSP;   SV *sv = TOPs;   register U8 *s;   STRLEN len;

  if (DO_UTF8(sv)) {

do something

  } else {

do something subtly different

  }   }

where the code in the two blocks doesn't just differ in the encoding they use to do the "thing" in\, but are actually implementing subtly different things.

And I was assuming that as many as possible of these blocks should be performing the same thing\, and that all that don't are listed.

Nicholas Clark

p5pRT commented 23 years ago

From @jhi

That this doesn't work is locale-dependent​: $E_accute is uc $e_accute\, and $e_accute is pure 8-bit character\, and whether uc upcases the $e_accute to $E_accute\, is dependent on the locale settings.

For example\, for my Finnish locale\, that test fails\, since $E_accute stays lowercase. But switching locale helps​:

LC_ALL=fr_FR.ISO8859-1 ./perl -Ilib -Mlocale t1 ok ok # 'É' eq 'É'

The $...utf version works because it obeys the Unicode lower/uppercase rules\, but that it got correctly mapped to Unicode in the first place is purely incidental​: the 0xE9 happened to be Latin-1\, which happens to be the lowest 256-character 'page' of Unicode.

Summary​: the bug cannot be solved without creative application of high-yield explosives to locales.

} }