Closed p5pRT closed 18 years ago
While working my way down doop.c\, I discovered that chomp completely ignores utf8 flags in both the chomped string and $/
With the following patch to t/op/chop.t there are many test failures. I'm not sure of the most efficient way to patch Perl_do_chomp to cure them. I guess use the existing byte comparison code if utf8 flags are the same on both the target and $/\, and do conversion otherwise\, but I'm not going to look further until after 5.8.3 is released.
ok 52 - start=78 end=78 ok 53 - start=78 end=163 not ok 54 - start=78 end=163 (end as bytes) # Failed at t/op/chop.t line 203 # got 'NÂ' # expected 'N£' ok 55 - start=78 end=163 ($/ as bytes) ok 56 - start=78 end=164 not ok 57 - start=78 end=164 (end as bytes) # Failed at t/op/chop.t line 203 # got 'N' # expected 'N¤' not ok 58 - start=78 end=164 ($/ as bytes) # Failed at t/op/chop.t line 209 # got 'N' # expected 'N¤' ok 59 - start=78 end=1296 not ok 60 - start=78 end=1296 (end as bytes) # Failed at t/op/chop.t line 203 # got 'N' # expected 'NÔ not ok 61 - start=78 end=1296 ($/ as bytes) # Failed at t/op/chop.t line 209 # got 'N' Wide character in print at ./test.pl line 38. # expected 'NÔ ok 62 - start=163 end=78 ok 63 - start=163 end=163 not ok 64 - start=163 end=163 (end as bytes) # Failed at t/op/chop.t line 203 # got '£Â' # expected '£Â£' ok 65 - start=163 end=163 ($/ as bytes) ok 66 - start=163 end=164 not ok 67 - start=163 end=164 (end as bytes) # Failed at t/op/chop.t line 203 # got '£' # expected '£Â¤' not ok 68 - start=163 end=164 ($/ as bytes) # Failed at t/op/chop.t line 209 # got '£' # expected '£¤' ok 69 - start=163 end=1296 not ok 70 - start=163 end=1296 (end as bytes) # Failed at t/op/chop.t line 203 # got '£' # expected '£Ô not ok 71 - start=163 end=1296 ($/ as bytes) # Failed at t/op/chop.t line 209 # got '£' Wide character in print at ./test.pl line 38. # expected 'Â£Ô ok 72 - start=164 end=78 Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94. Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95. not ok 73 - start=164 end=163 # Failed at t/op/chop.t line 193 Wide character in print at ./test.pl line 38. # got '¤Â' # expected '¤' Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94. Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95. not ok 74 - start=164 end=163 (end as bytes) # Failed at t/op/chop.t line 203 Wide character in print at ./test.pl line 38. # got '¤ÃÂ' # expected '¤Â£' not ok 75 - start=164 end=163 ($/ as bytes) # Failed at t/op/chop.t line 209 # got '¤' # expected '¤£' ok 76 - start=164 end=164 not ok 77 - start=164 end=164 (end as bytes) # Failed at t/op/chop.t line 203 # got '¤Â' # expected '¤Â¤' not ok 78 - start=164 end=164 ($/ as bytes) # Failed at t/op/chop.t line 209 # got '¤' # expected '¤¤' ok 79 - start=164 end=1296 ok 80 - start=164 end=1296 (end as bytes) not ok 81 - start=164 end=1296 ($/ as bytes) # Failed at t/op/chop.t line 209 # got '¤' Wide character in print at ./test.pl line 38. # expected 'Â¤Ô ok 82 - start=1296 end=78 Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94. Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95. not ok 83 - start=1296 end=163 # Failed at t/op/chop.t line 193 Wide character in print at ./test.pl line 38. # got 'Ô Wide character in print at ./test.pl line 38. # expected 'Ô Malformed UTF-8 character (unexpected end of string) at ./test.pl line 94. Malformed UTF-8 character (unexpected end of string) at ./test.pl line 95. not ok 84 - start=1296 end=163 (end as bytes) # Failed at t/op/chop.t line 203 Wide character in print at ./test.pl line 38. # got 'Ô Wide character in print at ./test.pl line 38. # expected 'Ô not ok 85 - start=1296 end=163 ($/ as bytes) # Failed at t/op/chop.t line 209 Wide character in print at ./test.pl line 38. # got 'Ô Wide character in print at ./test.pl line 38. # expected 'Ô ok 86 - start=1296 end=164 not ok 87 - start=1296 end=164 (end as bytes) # Failed at t/op/chop.t line 203 Wide character in print at ./test.pl line 38. # got 'Ô Wide character in print at ./test.pl line 38. # expected 'Ô not ok 88 - start=1296 end=164 ($/ as bytes) # Failed at t/op/chop.t line 209 Wide character in print at ./test.pl line 38. # got 'Ô Wide character in print at ./test.pl line 38. # expected 'Ô ok 89 - start=1296 end=1296 ok 90 - start=1296 end=1296 (end as bytes) not ok 91 - start=1296 end=1296 ($/ as bytes) # Failed at t/op/chop.t line 209 Wide character in print at ./test.pl line 38. # got 'Ô Wide character in print at ./test.pl line 38. # expected 'Ô
This is not a new utf8 bug.
On Mon\, Jan 12\, 2004 at 09:10:21PM -0000\, Nicholas Clark \perlbug\-followup@​perl\.org wrote:
While working my way down doop.c\, I discovered that chomp completely ignores utf8 flags in both the chomped string and $/
If you are auditing the code for UTF8 issues\, you might take a look for cases where SvUTF8/DO_UTF8 preceeds SvPV (or whatever else calls sv_2pv_flags)\, since this will fail for overloaded stringify that returns UTF8 (since the UTF8 flag isn't set until the stringify.) I noticed one in do_vop\, but haven't fixed it yet.
The RT System itself - Status changed from 'new' to 'open'
Moral - don't use a character for a test case which happens to be a substring of its UTF8 representation\, unless you specifically need this effect. (ie my testcase was slightly wrong)
On Mon\, Jan 12\, 2004 at 09:10:21PM -0000\, Nicholas Clark wrote:
While working my way down doop.c\, I discovered that chomp completely ignores utf8 flags in both the chomped string and $/
Change 22155 fixes this.
I presume that chomp really also ought to pay attention to the encoding pragma in the mixed bytes/utf8 case? [ie more work still needed]
Nicholas Clark
==== //depot/perl/doop.c#140 (text) ====
@@ -1008\,6 +1008\,7 @@ STRLEN len; STRLEN n_a; char *s; + char *temp_buffer = NULL;
if (RsSNARF(PL_rs)) return 0; @@ -1059\,6 +1060\,27 @@ else { STRLEN rslen; char *rsptr = SvPV(PL_rs\, rslen); + if (SvUTF8(PL_rs) != SvUTF8(sv)) { + /* Assumption is that rs is shorter than the scalar. */ + if (SvUTF8(PL_rs)) { + /* RS is utf8\, scalar is 8 bit. */ + bool is_utf8 = TRUE; + temp_buffer = (char*)bytes_from_utf8((U8*)rsptr\, + &rslen\, &is_utf8); + if (is_utf8) { + /* Cannot downgrade\, therefore cannot possibly match + */ + assert (temp_buffer == rsptr); + temp_buffer = NULL; + goto nope; + } + rsptr = temp_buffer; + } else { + /* RS is 8 bit\, scalar is utf8. */ + temp_buffer = (char*)bytes_to_utf8((U8*)rsptr\, &rslen); + rsptr = temp_buffer; + } + } if (rslen == 1) { if (*s != *rsptr) goto nope; @@ -1081\,6 +1103\,7 @@ SvSETMAGIC(sv); } nope: + Safefree(temp_buffer); return count; }
Nicholas Clark \nick@​ccl4\.org wrote
+ /* Assumption is that rs is shorter than the scalar. */
That comment looks scary. Where is the assumption made? What happens if it is false? Do you actually mean
/* if rs is longer than the scalar\, these conversions are a waste of time\, but the case is rare enough that we don't care */
?
Secondly:
While investigating where the assumption might be made\, I peered inside bytes_from_utf8() / bytes_to_utf8(). I note that the converted string is placed in a buffer allocated by Newz()\, but can't see anywhere that the buffer is freed. Why isn't this a horrendous memory leak (for existing uses\, as well as the new ones)?
Please tell me I'm missing something obvious.
Mike Guy
Mike Guy \mjtg@​cam\.ac\.uk writes:
Secondly:
While investigating where the assumption might be made\, I peered inside bytes_from_utf8() / bytes_to_utf8(). I note that the converted string is placed in a buffer allocated by Newz()\, but can't see anywhere that the buffer is freed. Why isn't this a horrendous memory leak (for existing uses\, as well as the new ones)?
Please tell me I'm missing something obvious.
You mean this bit:
nope: + Safefree(temp_buffer); return count; }
On Thu\, 15 Jan 2004 00:25:00 +0000 Nicholas Clark \nick@​ccl4\.org wrote:
Moral - don't use a character for a test case which happens to be a substring of its UTF8 representation\, unless you specifically need this effect. (ie my testcase was slightly wrong)
On Mon\, Jan 12\, 2004 at 09:10:21PM -0000\, Nicholas Clark wrote:
While working my way down doop.c\, I discovered that chomp completely ignores utf8 flags in both the chomped string and $/
Change 22155 fixes this.
I presume that chomp really also ought to pay attention to the encoding pragma in the mixed bytes/utf8 case? [ie more work still needed]
Hello.
(1) chomp() returns number of *characters* removed. So\, should \<count += rslen;> (number of bytes) not be good?
(2) For some multibyte or stateful encoding\, and in the case that either string or $/ is in bytes\, recoding to utf8 is required. (It must be inefficient for single byte encodings...)
(but AFAIK\, encoding.pm unicodifies strings in many cases\, like literals and inputs. So strings with UTF8 off might be rare under encoding pragma.)
(3) For many CJK encodings\, comparison in bytes has a problem which is not problematic for single byte encodings nor Unicode encodings (UTF-X).
Say\, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000)\, while "\x40" is '@' (as like ASCII). Then\, when saying \<$/ = "\x40"; $a = "\x81\x40"; chomp($a);>\, $a should not be chomped ($a eq "\x81" is very bad.)
Encodings which has such a problem include big5\, euc-jp\, GBK\, iso-2022-jp\, johab\, shift-jis\, UHC.
### $ patch
I wrote tests for chomp bytes in many CJK encoding as a new file. I'm not sure where this test should be placed (say\, perl/t/uni/ ?).
### ^ new test BEGIN { if ($ENV{'PERL_CORE'}){ chdir 't'; unshift @INC\, '../lib'; } require Config; import Config; if ($Config{'extensions'} !~ /\bEncode\b/) { print "1..0 # Skip: Encode was not built\n"; exit 0; } if (ord("A") == 193) { print "1..0 # Skip: EBCDIC\n"; exit 0; } unless (PerlIO::Layer->find('perlio')){ print "1..0 # Skip: PerlIO required\n"; exit 0; } eval 'use Encode'; if ($@ =~ /dynamic loading not available/) { print "1..0 # Skip: no dynamic loading\, no Encode\n"; exit 0; } }
use strict; use Test::More tests => (4 * 4 * 4) * (3); # (@char ** 3) * (keys %mbchars)
# %mbchars = (encoding => { bytes => utf8\, ... }\, ...); # * pack('C*') is expected to return bytes even if ${^ENCODING} is true. our %mbchars = ( 'big-5' => { pack('C*'\, 0x40) => pack('U*'\, 0x40)\, # COMMERCIAL AT pack('C*'\, 0xA4\, 0x40) => "\x{4E00}"\, # CJK-4E00 }\, 'euc-jp' => { pack('C*'\, 0xB0\, 0xA1) => "\x{4E9C}"\, # CJK-4E9C pack('C*'\, 0x8F\, 0xB0\, 0xA1) => "\x{4E02}"\, # CJK-4E02 }\, 'shift-jis' => { pack('C*'\, 0xA9) => "\x{FF69}"\, # halfwidth katakana small U pack('C*'\, 0x82\, 0xA9) => "\x{304B}"\, # hiragana KA }\, );
for my $enc (sort keys %mbchars) { local ${^ENCODING} = find_encoding($enc); my @char = (sort(keys %{ $mbchars{$enc} })\, sort(values %{ $mbchars{$enc} }));
for my $rs (@char) { local $/ = $rs; for my $start (@char) { for my $end (@char) { my $string = $start.$end; my $expect = $end eq $rs ? $start : $string; chomp $string; is($string\, $expect); } } } } ### ^ new test
regards SADAHIRO Tomoyuki
On Fri\, Jan 16\, 2004 at 04:13:00AM +0900\, SADAHIRO Tomoyuki wrote:
(1) chomp() returns number of *characters* removed. So\, should \<count += rslen;> (number of bytes) not be good?
Yes\, bug. Well spotted
(2) For some multibyte or stateful encoding\, and in the case that either string or $/ is in bytes\, recoding to utf8 is required. (It must be inefficient for single byte encodings...)
(but AFAIK\, encoding.pm unicodifies strings in many cases\, like literals and inputs. So strings with UTF8 off might be rare under encoding pragma.)
I've no idea\, but for the moment I'm happy to assume that they are rare\, and concentrate on correctness.
(3) For many CJK encodings\, comparison in bytes has a problem which is not problematic for single byte encodings nor Unicode encodings (UTF-X).
Say\, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000)\, while "\x40" is '@' (as like ASCII). Then\, when saying \<$/ = "\x40"; $a = "\x81\x40"; chomp($a);>\, $a should not be chomped ($a eq "\x81" is very bad.)
Encodings which has such a problem include big5\, euc-jp\, GBK\, iso-2022-jp\, johab\, shift-jis\, UHC.
This is what your new test tests?
I wrote tests for chomp bytes in many CJK encoding as a new file. I'm not sure where this test should be placed (say\, perl/t/uni/ ?).
I'm not sure either\, but for now it's t/uni/chomp.t
"Thanks\, applied"
Thanks for sorting out all the loose ends I left.
Nicholas Clark
On Fri\, Jan 16\, 2004 at 04:13:00AM +0900\, SADAHIRO Tomoyuki wrote:
(1) chomp() returns number of *characters* removed. So\, should \<count += rslen;> (number of bytes) not be good?
Yes\, bug. Well spotted
(2) For some multibyte or stateful encoding\, and in the case that either string or $/ is in bytes\, recoding to utf8 is required. (It must be inefficient for single byte encodings...)
(but AFAIK\, encoding.pm unicodifies strings in many cases\, like literals and inputs. So strings with UTF8 off might be rare under encoding pragma.)
I've no idea\, but for the moment I'm happy to assume that they are rare\, and concentrate on correctness.
(3) For many CJK encodings\, comparison in bytes has a problem which is not problematic for single byte encodings nor Unicode encodings (UTF-X).
Say\, suppose encoding is shift-jis. "\x81\x40" is IDSP (U+3000)\, while "\x40" is '@' (as like ASCII). Then\, when saying \<$/ = "\x40"; $a = "\x81\x40"; chomp($a);>\, $a should not be chomped ($a eq "\x81" is very bad.)
Encodings which has such a problem include big5\, euc-jp\, GBK\, iso-2022-jp\, johab\, shift-jis\, UHC.
This is what your new test tests?
I wrote tests for chomp bytes in many CJK encoding as a new file. I'm not sure where this test should be placed (say\, perl/t/uni/ ?).
I'm not sure either\, but for now it's t/uni/chomp.t
"Thanks\, applied"
Thanks for sorting out all the loose ends I left.
Nicholas Clark
Was fixed by change 22155.
@nwc10 - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#24888 (status was 'resolved')
Searchable as RT24888$