Closed p5pRT closed 20 years ago
I'm assuming it's a bug that uc() for accented characters in the range 196-255 differs depending on whether they happen to be UTF8 encoded or not. I shouldn't be able to detect the internal state of UTF8 encoding in any way from a perl script\, should I?
The difference is certainly present in 5.6.1\, and I assume is in everything post 5.005.
Is this the suggested way to supply a "test case" with bug reports?
On 5.6.1 and bleadperl the following give ok\, not ok. (ie perl reports that the first two scalars are equal\, yet uc() gives different results)
5.005_03 reports ok\,ok; but uc doesn't change either lower case character\, as 5.005_03 isn't assuming that they are e accutes.
I would expect that Unicode aware perl should give ok\,ok\, but I'm not sure how this is reconciled with the desire to have uc() give the same backwards compatible result as 5.005_03.
#!/usr/local/bin/perl -w
{ my ($e_accute_utf) = my ($e_accute) = chr 0xE9; $e_accute_utf .= chr 300; chop $e_accute_utf; my $E_accute = uc $e_accute; my $E_accute_utf = uc $e_accute_utf;
if ($e_accute_utf eq $e_accute) { print "ok\n"; } else { print "not ok # '$e_accute_utf' ne '$e_accute'\n"; } if ($E_accute_utf eq $E_accute) { print "ok # '$E_accute_utf' eq '$E_accute'\n"; } else { print "not ok # '$E_accute_utf' ne '$E_accute'\n"; } }
That was the hope but that was not how it turned out to be. No\, that deal will not be torn open again\, this decision came finally from Rule #1. One is certainly able to find out the UTF8ness\, in various ways.
Then again\, on to your particular bug report\, you might be right in your analysis\, I haven't looked too closely yet.
The difference is certainly present in 5.6.1\, and I assume is in everything post 5.005.
Is this the suggested way to supply a "test case" with bug reports?
On 5.6.1 and bleadperl the following give ok\, not ok. (ie perl reports that the first two scalars are equal\, yet uc() gives different results)
5.005_03 reports ok\,ok; but uc doesn't change either lower case character\, as 5.005_03 isn't assuming that they are e accutes.
I would expect that Unicode aware perl should give ok\,ok\, but I'm not sure how this is reconciled with the desire to have uc() give the same backwards compatible result as 5.005_03.
#!/usr/local/bin/perl -w
{ my ($e_accute_utf) = my ($e_accute) = chr 0xE9; $e_accute_utf .= chr 300; chop $e_accute_utf; my $E_accute = uc $e_accute; my $E_accute_utf = uc $e_accute_utf;
if ($e_accute_utf eq $e_accute) { print "ok\n"; } else { print "not ok # '$e_accute_utf' ne '$e_accute'\n"; } if ($E_accute_utf eq $E_accute) { print "ok # '$E_accute_utf' eq '$E_accute'\n"; } else { print "not ok # '$E_accute_utf' ne '$E_accute'\n"; } }
I'm assuming it's a bug that uc() for accented characters in the range 196-255 differs depending on whether they happen to be UTF8 encoded or not. I shouldn't be able to detect the internal state of UTF8 encoding in any way from a perl script\, should I?
That was the hope but that was not how it turned out to be. No\, that deal will not be torn open again\, this decision came finally from Rule #1. One is certainly able to find out the UTF8ness\, in various ways.
Oops. I didn't mean to restart a messy non-terminating discussion. [I remember some of these threads. There is no right answer]
Then again\, on to your particular bug report\, you might be right in your analysis\, I haven't looked too closely yet.
I was thinking about things like
PP(pp_uc) { dSP; SV *sv = TOPs; register U8 *s; STRLEN len;
if (DO_UTF8(sv)) {
do something
} else {
do something subtly different
} }
where the code in the two blocks doesn't just differ in the encoding they use to do the "thing" in\, but are actually implementing subtly different things.
And I was assuming that as many as possible of these blocks should be performing the same thing\, and that all that don't are listed.
Nicholas Clark
That this doesn't work is locale-dependent: $E_accute is uc $e_accute\, and $e_accute is pure 8-bit character\, and whether uc upcases the $e_accute to $E_accute\, is dependent on the locale settings.
For example\, for my Finnish locale\, that test fails\, since $E_accute stays lowercase. But switching locale helps:
LC_ALL=fr_FR.ISO8859-1 ./perl -Ilib -Mlocale t1 ok ok # 'É' eq 'É'
The $...utf version works because it obeys the Unicode lower/uppercase rules\, but that it got correctly mapped to Unicode in the first place is purely incidental: the 0xE9 happened to be Latin-1\, which happens to be the lowest 256-character 'page' of Unicode.
Summary: the bug cannot be solved without creative application of high-yield explosives to locales.
} }
Migrated from rt.perl.org#7201 (status was 'resolved')
Searchable as RT7201$