Closed p5pRT closed 16 years ago
The "use bytes" pragma is useful for code which only wants to handle bytes.
substr()\, length()\, index()\, pos() and regex matching all ignore the UTF-8 flag on strings in the scope of this pragma.
However\, string concatenation does not take this pragma into account. Just like without the pragma\, it upgrades strings to UTF-8 if any of them are UTF-8.
This is quite inconsistent with the algebraic properties expected of byte strings\, such as:
length(substr($a\,0\,1).substr($a\,1)) == length($a)
Here's an example program which illustrates this:
$x="\x{100}abc"; $y="\x{80}def"; use bytes; print length($x)\, "\,"\, length($y)\, "\n"; $z = $x.substr($x\,0\,1).substr($x\,1).$y; print length($x)\, "\,"\, length($y)\, "\,"\, length($z)\, "\n";
The program prints:
5\,4 5\,5\,17
Those numbers make no sense. In bytes\, length($x) is 5 and length($y) is 4. After the concatenation\, the total is 17\, when it should logically be 14.
(This also shows length($y) is modified simply by $y being read\, reported as [perl #26901]. In this case\, length($y) is 4 before the concatenation but 5 after.)
Summary: I think string concatenation should _not_ upgrade non-UTF-8 strings to UTF-8 when they are concatenated inside the scope of "use bytes". A warning or even an exception may be appropriate.
SADAHIRO Tomoyuki wrote:
This is because join() internally uses sv_catsv() which considers bytes.pm.
Here is a patch against perl-current. After this patch the above example prints: [snip]
After my patch for pp_hot.c\, some tests for Encode fail.
t/CJKT.t 1 256 60 1 1.67% 22 t/at-cn.t 2 512 29 2 6.90% 18 20 t/perlio.t 2 512 38 2 5.26% 7-8
This is unnecessary (I think) declaration of \
In addition perlio_ok returning constantly true is wrong. (it should return false if PerlIO::encoding is not available) So the default method in Encode::Encoding:: should be used.
Thanks\, both patches applied to bleadperl as change #22363. Note that I've changed the version number of Encode::CN::HZ to 1.05_01. The change to Encode::CN::HZ should probably be made conditional on perl version >= 5.9.1.
The RT System itself - Status changed from 'new' to 'open'
On Feb 23\, 2004\, at 01:26\, Autrijus Tang wrote:
On Sun\, Feb 22\, 2004 at 06:41:43PM +0900\, SADAHIRO Tomoyuki wrote:
After my patch for pp_hot.c\, some tests for Encode fail. re t/CJKT.t 1 256 60 1 1.67% 22 t/at-cn.t 2 512 29 2 6.90% 18 20 t/perlio.t 2 512 38 2 5.26% 7-8
This is unnecessary (I think) declaration of \
As the author of HZ.pm\, I think the patch makes perfect sense. :-)
Sorry for my slow response. I was too busy to be online for last few days.
I just checked the patch on both 5.8.0 and 5.8.3 and worked fine. So it is backward-compatible. Now there is no reason not to let your patch in. I already did so in my repository.
Pumpking(s)\, please go ahead apply his patch.
Dan the Encode Maintainer
p5p@spam.wizbit.be - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#26905 (status was 'resolved')
Searchable as RT26905$