lc() + Latin-1 chars is failing erratically

p5pRT commented 18 years ago

Migrated from rt.perl.org#37999 (status was 'resolved')

Searchable as RT37999$

p5pRT commented 18 years ago

From skunk@iskunk.org

I have a script that is processing a list of words in Latin-1 encoding. It is taking one word from each line\, lowercasing it\, and writing it out.

I had found that certain accented letters in a word were not being lowercased by lc()\, even though other (ASCII) letters in the same word were. At first I thought that an encoding issue was to blame\, but after hacking down a minimal bug case\, I found the problem:

If I chomp() the string before lc()ing it\, everything works fine. If I chop() it first---even though the resulting string is identical---the case transformation fails. (Same result if I do neither\, retaining the trailing "\n".) Also\, if I don't read the input from a file\, but merely place it inline in the program\, everything works (with chomp() and chop() alike).

I am attaching both the test script and input file; please review the comments in the script. If the script dies with "aaaaaaaack!" then the bug is present.

This bug has been reproduced with Perl 5.8.x built from development source. Locale settings do not appear to affect it (happens with LANG=C\, etc.).

--Daniel

-- NAME = Daniel Richard G. ## Remember\, skunks _\|/_ meef? EMAIL1 = skunk@iskunk.org ## don't smell bad--- (/o|o\) / EMAIL2 = skunk@alum.mit.edu ## it's the people who \< (^)\,> WWW = http://www.******.org/ ## annoy them that do! / \ -- (****** = site not yet online)

p5pRT commented 18 years ago

From skunk@iskunk.org

bug.pl

p5pRT commented 18 years ago

From skunk@iskunk.org

Ã-Wagen

p5pRT commented 18 years ago

From @rgs

Daniel Richard G.(via RT) wrote:

If I chomp() the string before lc()ing it\, everything works fine. If I chop() it first---even though the resulting string is identical---the case transformation fails. (Same result if I do neither\, retaining the trailing "\n".) Also\, if I don't read the input from a file\, but merely place it inline in the program\, everything works (with chomp() and chop() alike).

This seems to be related to encoding. Cargo-culting the following snippet from do_chomp() to do_chop() seems to fix it. Tests running...

--- doop.c (rÉ¹vision 6377) +++ doop.c (copie de travail) @@ -967\,6 +967\,16 @@ if (SvREADONLY(sv)) Perl_croak(aTHX_ PL_no_modify); } + if (PL_encoding) { + if (!SvUTF8(sv)) { + /* XXX\, here sv is utf8-ized as a side-effect! + If encoding.pm is used properly\, almost string-generating + operations\, including literal strings\, chr()\, input data\, etc. + should have been utf8-ized already\, right? + */ + sv_recode_to_utf8(sv\, PL_encoding); + } + } s = SvPV(sv\, len); if (len && !SvPOK(sv)) s = SvPV_force(sv\, len);

p5pRT commented 18 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 18 years ago

From @rgs

Rafael Garcia-Suarez wrote:

Daniel Richard G.(via RT) wrote:

If I chomp() the string before lc()ing it\, everything works fine. If I chop() it first---even though the resulting string is identical---the case transformation fails. (Same result if I do neither\, retaining the trailing "\n".) Also\, if I don't read the input from a file\, but merely place it inline in the program\, everything works (with chomp() and chop() alike).

This seems to be related to encoding. Cargo-culting the following snippet from do_chomp() to do_chop() seems to fix it. Tests running...

now commited as #26431. (with prettier comments)

--- doop.c (rÉ¹vision 6377) +++ doop.c (copie de travail)

p5pRT commented 18 years ago

@rgs - Status changed from 'open' to 'resolved'

p5pRT commented 18 years ago

From skunk@iskunk.org

I've confirmed that the bug no longer occurs when using chop()\, or even s/\n// in the same place.
However\, if I don't modify the string (no chomp/chop/etc.)\, remove the "eq" check\, and lc() it with newline and all\, the accented U again stays as-is. (This is with source from perl-current.)
I'm not familiar with Perl's internals\, but if lc() is failing due to its argument not having been previously mirrored in a Perl-internal UTF-8 representation... would it not make sense to have the check-and-reencode bit at the top of lc()'s implementation (and in other functions making use of encoding-dependent semantics)\, rather than attempt to cover all possible origins of lc()'s argument?
(Quick question\, btw: As a workaround for my scripts\, is there a concise way of bestowing internal-UTF-8-ness on a string without otherwise modifying it?)

p5pRT commented 18 years ago

skunk@iskunk.org - Status changed from 'resolved' to 'open'

p5pRT commented 18 years ago

From skunk@iskunk.org

I've confirmed that the bug no longer occurs when using chop()\, or even s/\n// in the same place.

However\, if we don't modify the string (no chomp/chop/etc.)\, remove the "eq" check\, and lc() it with newline and all\, the accented U again stays as-is. (This is with source from perl-current.)

I'm not familiar with Perl's internals\, but if lc() is failing due to its argument not having been previously mirrored in a Perl-internal UTF-8 representation... would it not make sense to have the check-and-reencode bit at the top of lc()'s implementation (and in other functions making use of encoding-dependent semantics)\, rather than attempt to cover all possible origins of lc()'s argument?

And a quick question: As a workaround for my scripts\, is there a concise way of bestowing internal-UTF8-ness on a string without otherwise modifying it?

p5pRT commented 18 years ago

From @rgarcia

On 12/21/05\, Daniel Richard G. \skunk@iskunk\.org wrote:

I've confirmed that the bug no longer occurs when using chop()\, or even s/\n// in the same place.

However\, if we don't modify the string (no chomp/chop/etc.)\, remove the "eq" check\, and lc() it with newline and all\, the accented U again stays as-is. (This is with source from perl-current.)

Well\, my understanding is that it's the documented behaviour if you don't use locale. (see perldoc locale)

p5pRT commented 18 years ago

From skunk@iskunk.org

[rgarciasuarez@gmail.com - Wed Dec 21 13:57:15 2005]:

Well\, my understanding is that it's the documented behaviour if you don't use locale. (see perldoc locale)

I can add "use locale" to the test script\, set LANG=LC_ALL=LC_CTYPE=C\, and the behavior is the same as before. Either lc() is wrong to lowercase the accented-U in that instance (assuming the C locale means it shouldn't know how to handle non-ASCII characters)\, or this behavior where chop/chomp affects lc()'s result on seemingly identical input is wrong.

(For my part\, I'd prefer to be able to use "no locale" and have lc() behave according to Unicode semantics\, than have to specify a locale that matches Unicode semantics and worry about tainting\, etc.)

p5pRT commented 14 years ago

From @khwilliamson

On Wed Dec 21 14:49:31 2005\, skunk wrote:

[rgarciasuarez@gmail.com - Wed Dec 21 13:57:15 2005]:

Well\, my understanding is that it's the documented behaviour if you don't use locale. (see perldoc locale)

I can add "use locale" to the test script\, set LANG=LC_ALL=LC_CTYPE=C\, and the behavior is the same as before. Either lc() is wrong to lowercase the accented-U in that instance (assuming the C locale means it shouldn't know how to handle non-ASCII characters)\, or this behavior where chop/chomp affects lc()'s result on seemingly identical input is wrong.

(For my part\, I'd prefer to be able to use "no locale" and have lc() behave according to Unicode semantics\, than have to specify a locale that matches Unicode semantics and worry about tainting\, etc.)

Perl 5.12 (unless glitches arise) will be released April 5\, 2010. It is adding the statement use feature "unicode_strings";

This will cause lc() in the scope of the 'use' statement to behave as you would hope on Latin1 characters. Therefore\, I'm closing this ticket. --Karl Williamson

p5pRT commented 14 years ago

@khwilliamson - Status changed from 'open' to 'resolved'

Perl / perl5