Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 560 forks source link

~ on wide chars #2640

Closed p5pRT closed 20 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#4332 (status was 'resolved')

Searchable as RT4332$

p5pRT commented 24 years ago

From @gisle

Created by @gisle

The ~ operation on UTF8 flagged strings does not do the right thing​:

$ perl -MDevel​::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740   REFCNT = 1   FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8)   PV = 0x816dee0 ";S"\0   CUR = 2   LEN = 3

It just flips the bits\, but does not even turn off the UTF8 flag.

It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)

I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope. Currently it evaluates to chr(255).

Perl Info ``` Flags: category=core severity=low Site configuration information for perl v5.7.0: Configured by gisle at Tue Sep 5 09:56:22 CEST 2000. Summary of my perl5 (revision 5.0 version 7 subversion 0) configuration: Platform: osname=linux, osvers=2.2.14, archname=i686-linux-thread-multi uname='linux eik 2.2.14 #1 fri mar 17 11:59:50 gmt 2000 i686 unknown ' config_args='-Dusedevel -Dprefix=/local/perl/5.7.0_thr -Dusethreads -Doptimize=-g -ders' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g', cppflags='-D_REENTRANT -DDEBUGGING -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldbm -ldb -ldl -lm -lpthread -lc -lposix -lcrypt -lutil libc=, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.7.0: /local/perl/5.7.0_thr/lib/5.7.0/i686-linux-thread-multi /local/perl/5.7.0_thr/lib/5.7.0 /local/perl/5.7.0_thr/lib/site_perl/5.7.0/i686-linux-thread-multi /local/perl/5.7.0_thr/lib/site_perl/5.7.0 /local/perl/5.7.0_thr/lib/site_perl . Environment for perl v5.7.0: HOME=/home/gisle LANG=POSIX LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=... PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 24 years ago

From @ysth

In article \20000918174601\.2012\.qmail@​eik\.g\.aas\.no\, gisle@​aas.no wrote​:

The ~ operation on UTF8 flagged strings does not do the right thing​:

$ perl -MDevel​::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3

It just flips the bits\, but does not even turn off the UTF8 flag.

It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)

I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope. Currently it evaluates to chr(255).

Hmm. I don't really see a reasonable use for this (~ on strings with chars > 255). The others (^\, |\, &) lend themselves to a convenient definition for what to do with chars > 255. Perhaps then the best thing would be to maintain as much backward-compatibility as possible and truncate each char to 8 bits after ~-ing.

On the other hand\, if one is creating a bitmask to later use with ^\, &\, or |\, it would make sense to set the maximum number of bits in a perl-utf8 char. But that produces pretty long strings from e.g. "\0\0\0". As well as the surprise UTF8-encoded string resulting from ~ on a non-UTF8-encoded string.

Either way\, I see no reason to limit it to official UTF8 or int size.

p5pRT commented 24 years ago

From @jhi

On Mon\, Sep 18\, 2000 at 08​:45​:08PM -0700\, Yitzchak Scott-Thoennes wrote​:

In article \20000918174601\.2012\.qmail@​eik\.g\.aas\.no\, gisle@​aas.no wrote​:

The ~ operation on UTF8 flagged strings does not do the right thing​:

$ perl -MDevel​::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3

It just flips the bits\, but does not even turn off the UTF8 flag.

It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)

How about this​: if the $string is in utf8​:

~"$string" eq join(""\, map { ~ord($_) } split //\, $string);

and preserve the utf8ness\, because we want ~~X eq X.

I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope.

Sounds like the above.

Currently it evaluates to chr(255).

On the other hand\, if one is creating a bitmask to later use with ^\, &\, or |\, it would make sense to set the maximum number of bits in a perl-utf8 char. But that produces pretty long strings from e.g. "\0\0\0". As well as the

So does ~0 (it produces pretty "long integers").

surprise UTF8-encoded string resulting from ~ on a non-UTF8-encoded string.

p5pRT commented 24 years ago

From @jhi

How about this​: if the $string is in utf8​:

~"$string" eq join(""\, map { ~ord($_) } split //\, $string); and preserve the utf8ness\, because we want ~~X eq X.

Argblebargle. That didn't come out right. I meant

~"$string" eq join(""\, map { chr(~ord($_)) } split //\, $string);

So ~(chr(200).chr(2000)) would be chr(~200).chr(~2000).

p5pRT commented 24 years ago

From @simoncozens

On Sat\, Oct 14\, 2000 at 01​:12​:11PM -0500\, Jarkko Hietaniemi wrote​:

~"$string" eq join(""\, map { chr(~ord($_)) } split //\, $string); So ~(chr(200).chr(2000)) would be chr(~200).chr(~2000).

  Make UTF8 ~chr($x) == chr(~$x)

==== //depot/bleadperl/pp.c#7 (text) ==== Index​: perl/pp.c

Inline Patch ```diff --- perl/pp.c.~1~ Sat Oct 14 20:50:48 2000 +++ perl/pp.c Sat Oct 14 20:50:48 2000 @@ -1476,6 +1476,38 @@ SvSetSV(TARG, sv); tmps = SvPV_force(TARG, len); anum = len; + if (SvUTF8(TARG)) { + /* Calculate exact length, let's not estimate */ + STRLEN targlen = 0; + U8 *result; + char *send; + + send = tmps + len; + while (tmps < send) { + I32 l; + UV c = utf8_to_uv(tmps, &l); + c = (UV)~c; + tmps += UTF8SKIP(tmps); + targlen += UTF8LEN(c); + } + + /* Now rewind strings and write them. */ + tmps -= len; + Newz(0, result, targlen + 1, U8); + while (tmps < send) { + I32 l; + UV c = utf8_to_uv(tmps, &l); + tmps += UTF8SKIP(tmps); + result = uv_to_utf8(result,(UV)~c); + } + *result = '\0'; + result -= targlen; + sv_setpvn(TARG, result, targlen); + SvUTF8_on(TARG); + Safefree(result); + SETs(TARG); + RETURN; + } #ifdef LIBERAL for ( ; anum && (unsigned long)tmps % sizeof(long); anum--, tmps++) *tmps = ~*tmps; ```

==== //depot/bleadperl/t/op/bop.t#5 (xtext) ==== Index​: perl/t/op/bop.t

Inline Patch ```diff --- perl/t/op/bop.t.~1~ Sat Oct 14 20:50:48 2000 +++ perl/t/op/bop.t Sat Oct 14 20:50:48 2000 @@ -9,7 +9,7 @@ @INC = '../lib'; } -print "1..35\n"; +print "1..37\n"; # numerics print ((0xdead & 0xbeef) == 0x9ead ? "ok 1\n" : "not ok 1\n"); @@ -82,9 +82,9 @@ print "ok 29\n" if sprintf("%vd", v4095.801.4095 | v801.4095) eq '4095.4095.4095'; print "ok 30\n" if sprintf("%vd", v801.4095 ^ v4095.801.4095) eq '3294.3294.4095'; # -print "ok 31\n" if sprintf("%vd", v120.v300 & v200.400) eq '72.256'; -print "ok 32\n" if sprintf("%vd", v120.v300 | v200.400) eq '248.444'; -print "ok 33\n" if sprintf("%vd", v120.v300 ^ v200.400) eq '176.188'; +print "ok 31\n" if sprintf("%vd", v120.300 & v200.400) eq '72.256'; +print "ok 32\n" if sprintf("%vd", v120.300 | v200.400) eq '248.444'; +print "ok 33\n" if sprintf("%vd", v120.300 ^ v200.400) eq '176.188'; # my $a = v120.300; my $b = v200.400; @@ -94,3 +94,20 @@ my $b = v200.400; $a |= $b; print "ok 35\n" if sprintf("%vd", $a) eq '248.444'; +# +# UTF8 ~ behaviour +for (0x100...0xFFF) { + $a = ~(chr $_); + print "not" if $a ne chr(~$_) or length($a) != 1 or ~$a ne chr($_); +} +print "ok 36\n"; + +for my $i (0xEEE...0xF00) { + for my $j (0x0..0x120) { + $a = ~(chr ($i) . chr $j); + print "not" if $a ne chr(~$i).chr(~$j) + or length($a) != 2 + or ~$a ne chr($i).chr($j); + } +} +print "ok 37\n"; ```

==== //depot/bleadperl/utf8.h#4 (text) ==== Index​: perl/utf8.h

Inline Patch ```diff --- perl/utf8.h.~1~ Sat Oct 14 20:50:48 2000 +++ perl/utf8.h Sat Oct 14 20:50:48 2000 @@ -35,6 +35,24 @@ #define UTF8SKIP(s) PL_utf8skip[*(U8*)s] +#ifdef HAS_QUAD +#define UTF8LEN(uv) ( (uv) < 0x80 ? 1 : \ + (uv) < 0x800 ? 2 : \ + (uv) < 0x10000 ? 3 : \ + (uv) < 0x200000 ? 4 : \ + (uv) < 0x4000000 ? 5 : \ + (uv) < 0x80000000 ? 6 : \ + (uv) < 0x1000000000LL ? 7 : 13 ) +#else +/* No, I'm not even going to *TRY* putting #ifdef inside a #define */ +#define UTF8LEN(uv) ( (uv) < 0x80 ? 1 : \ + (uv) < 0x800 ? 2 : \ + (uv) < 0x10000 ? 3 : \ + (uv) < 0x200000 ? 4 : \ + (uv) < 0x4000000 ? 5 : \ + (uv) < 0x80000000 ? 6 : 7 ) +#endif + /* * Note: we try to be careful never to call the isXXX_utf8() functions * unless we're pretty sure we've seen the beginning of a UTF-8 character End of Patch. ```
p5pRT commented 24 years ago

From @jhi

On Sat\, Oct 14\, 2000 at 08​:52​:13PM +0100\, Simon Cozens wrote​:

On Sat\, Oct 14\, 2000 at 01​:12​:11PM -0500\, Jarkko Hietaniemi wrote​:

~"$string" eq join(""\, map { chr(~ord($_)) } split //\, $string); So ~(chr(200).chr(2000)) would be chr(~200).chr(~2000).

Make UTF8 ~chr\($x\) == chr\(~$x\) 

==== //depot/bleadperl/pp.c#7 (text) ==== Index​: perl/pp.c --- perl/pp.c.~1~ Sat Oct 14 20​:50​:48 2000 +++ perl/pp.c Sat Oct 14 20​:50​:48 2000

Does not work on an alpha. Here's the output​:

1..37 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 ok 7 ok 8 ok 9 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 ok 18 ok 19 ok 20 ok 21 ok 22 ok 23 ok 24 ok 25 ok 26 ok 27 ok 28 ok 29 ok 30 ok 31 ok 32 ok 33 ok 34 ok 35 notnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotnotok 36  37

-- $jhi++; # http​://www.iki.fi/jhi/   # There is this special biologist word we use for 'stable'.   # It is 'dead'. -- Jack Cohen

p5pRT commented 23 years ago

From @ysth

Jarkko Hietaniemi \jhi@&#8203;iki\.fi wrote​:

On Mon\, Sep 18\, 2000 at 08​:45​:08PM -0700\, Yitzchak Scott-Thoennes wrote​:

In article \20000918174601\.2012\.qmail@&#8203;eik\.g\.aas\.no\, gisle@​aas.no wrote​:

The ~ operation on UTF8 flagged strings does not do the right thing​:

$ perl -MDevel​::Peek -e 'Dump(~v300)' SV = PV(0x8160cac) at 0x8160740 REFCNT = 1 FLAGS = (PADBUSY\,PADTMP\,POK\,READONLY\,pPOK\,UTF8) PV = 0x816dee0 ";S"\0 CUR = 2 LEN = 3

It just flips the bits\, but does not even turn off the UTF8 flag.

It is not clear to me what the operation should do. One way is to use 0..10FFFF (the official range of UTF8) and flip bits based on that. That seem kind of wrong. I would suggest that we simply flip bits as if the character was an 'int'. (That would create a fairly long string internally on 64bit machines.)

How about this​: if the $string is in utf8​:

Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).

~"$string" eq join(""\, map { ~ord($_) } split //\, $string);

Correction in followup email noted. The patch based on this looks ok except that it should be checking !IN_BYTE\, not SvUTF8.

and preserve the utf8ness\, because we want ~~X eq X.

I would also argue that ~"\0" should evauate into the same as chr(~0) unless inside 'use bytes' scope.

Sounds like the above.

Currently it evaluates to chr(255).

On the other hand\, if one is creating a bitmask to later use with ^\, &\, or |\, it would make sense to set the maximum number of bits in a perl-utf8 char. But that produces pretty long strings from e.g. "\0\0\0". As well as the

So does ~0 (it produces pretty "long integers").

surprise UTF8-encoded string resulting from ~ on a non-UTF8-encoded string.

-- $jhi++; # http​://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 23 years ago

From @jhi

How about this​: if the $string is in utf8​:

Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).

I'm sorry but I'm very dense today. Please explain your reasoning. Make certain that your definition of ~ obeys the rules

  (1) ~~x == x   (2) ~(x|y) == ~x&~y   (3) ~(x&y) == ~x|~y   (4) x|~x == 1   (5) x&~x == 0  
or there is not much point in implementing ~ at all...

~"$string" eq join(""\, map { ~ord($_) } split //\, $string);

Correction in followup email noted. The patch based on this looks ok except that it should be checking !IN_BYTE\, not SvUTF8.

Ummm\, why should we pay the speed hit of utf* function calls for byte data?

p5pRT commented 23 years ago

From @ysth

Jarkko Hietaniemi \jhi@&#8203;iki\.fi wrote​:

How about this​: if the $string is in utf8​:

Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).

I'm sorry but I'm very dense today. Please explain your reasoning. Make certain that your definition of ~ obeys the rules

\(1\) ~~x    == x
\(2\) ~\(x|y\) == ~x&~y
\(3\) ~\(x&y\) == ~x|~y
\(4\) x|~x   == 1
\(5\) x&~x   == 0

or there is not much point in implementing ~ at all...

Yes. (Even with s/==/eq/.) Though ~~x will upgrade to utf8 encoding if it wasn't already on.

What's missing is​:

  (6) $x eq $y implies ~$x eq ~$y.

[D​:\perl-current].\perl -wlIlib $x = v200; chop($y = v200.300); print "\$x eq \$y" if $x eq $y; print "~\$x eq ~\$y" if ~$x eq ~$y; __END__ $x eq $y

The cardinal rule to which I refer was stated as​:   * It doesn't matter if data gets upgraded to UTF8 internally; if   there is a place where it does matter\, that's a bug. in This Week on perl5-porters (9--23 October 2000).

~"$string" eq join(""\, map { ~ord($_) } split //\, $string);

Correction in followup email noted. The patch based on this looks ok except that it should be checking !IN_BYTE\, not SvUTF8.

Ummm\, why should we pay the speed hit of utf* function calls for byte data?

I'm sorry\, but I'm very dense today too. What do you mean?

p5pRT commented 23 years ago

From @ysth

In article \20001030181150\.A27977@&#8203;chaos\.wustl\.edu\, Jarkko Hietaniemi \jhi@&#8203;iki\.fi wrote​:

How about this​: if the $string is in utf8​:

Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).

I'm sorry but I'm very dense today. Please explain your reasoning. Make certain that your definition of ~ obeys the rules

\(1\) ~~x    == x
\(2\) ~\(x|y\) == ~x&~y
\(3\) ~\(x&y\) == ~x|~y
\(4\) x|~x   == 1
\(5\) x&~x   == 0

or there is not much point in implementing ~ at all...

Well\, I was going to demonstrate how things currently fail rules 2 and 3 above​:

#!/usr/bin/perl -w $x = v200; $y = v300; print "1..2\n"; print 'not ' if ~("$x"|$y) ne (~$x&~$y); print "ok 1\n"; print 'not ' if ~("$x"&$y) ne (~$x|~$y); print "ok 2\n"; __END__

But I quickly discovered that ~$y is pretty useless with utf8 since just about anything you try to do gets you a Malformed utf warning. I think this is a reasonable fix for that (though you might or might not want the pp.c change--it's pp_ord)​:

Inline Patch ```diff --- doop.c#7483 Sun Oct 29 12:22:32 2000 +++ doop.c Tue Oct 31 00:28:04 2000 @@ -968,10 +968,10 @@ switch (optype) { case OP_BIT_AND: while (lulen && rulen) { - luc = utf8_to_uv((U8*)lc, lulen, &ulen, 0); + luc = utf8_to_uv((U8*)lc, lulen, &ulen, UTF8_ALLOW_ANY); lc += ulen; lulen -= ulen; - ruc = utf8_to_uv((U8*)rc, rulen, &ulen, 0); + ruc = utf8_to_uv((U8*)rc, rulen, &ulen, UTF8_ALLOW_ANY); rc += ulen; rulen -= ulen; duc = luc & ruc; @@ -983,10 +983,10 @@ break; case OP_BIT_XOR: while (lulen && rulen) { - luc = utf8_to_uv((U8*)lc, lulen, &ulen, 0); + luc = utf8_to_uv((U8*)lc, lulen, &ulen, UTF8_ALLOW_ANY); lc += ulen; lulen -= ulen; - ruc = utf8_to_uv((U8*)rc, rulen, &ulen, 0); + ruc = utf8_to_uv((U8*)rc, rulen, &ulen, UTF8_ALLOW_ANY); rc += ulen; rulen -= ulen; duc = luc ^ ruc; @@ -995,10 +995,10 @@ goto mop_up_utf; case OP_BIT_OR: while (lulen && rulen) { - luc = utf8_to_uv((U8*)lc, lulen, &ulen, 0); + luc = utf8_to_uv((U8*)lc, lulen, &ulen, UTF8_ALLOW_ANY); lc += ulen; lulen -= ulen; - ruc = utf8_to_uv((U8*)rc, rulen, &ulen, 0); + ruc = utf8_to_uv((U8*)rc, rulen, &ulen, UTF8_ALLOW_ANY); rc += ulen; rulen -= ulen; duc = luc | ruc; --- pp.c#7483 Sun Oct 29 12:23:32 2000 +++ pp.c Tue Oct 31 00:32:38 2000 @@ -2240,7 +2240,7 @@ STRLEN retlen; if ((*tmps & 0x80) && DO_UTF8(tmpsv)) - value = utf8_to_uv(tmps, len, &retlen, 0); + value = utf8_to_uv(tmps, len, &retlen, UTF8_ALLOW_ANY); else value = (UV)(*tmps & 255); XPUSHu(value); End of Patch. ```
p5pRT commented 23 years ago

From @jhi

But I quickly discovered that ~$y is pretty useless with utf8 since just about anything you try to do gets you a Malformed utf warning. I think this is a reasonable fix for that (though you might or might not want the pp.c change--it's pp_ord)​:

Applied\, thanks\, sand the pp_ord() change. Further patches welcome\, though please consider carefully the UTF8_ALLOW flags. If we allow anything everywhere\, the UTF-8 decoding checking might as well be removed.

p5pRT commented 23 years ago

From @jhi

How about this​: if the $string is in utf8​:

Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).

Yes\, looks like we need to define the semantics of this more tightly.

Consider

  $a0 = "\0";   $b0 = substr("\0\x{100}"\, 0\, 1);

  $a1 = ~$a0;   $b1 = ~$b0;

Yes\, I agree it would be nice to have $a1 eq $b1 since $a0 eq $b0. But that's not how it currently goes. $a1 is a pure "byte string"\, it has never been been touched by "a wide character" -- but $b1 is a "character string" since it's "parent" was. Bytewise the $a0 and $b0 are identical but $b1 carries the evil UTF8 flag. Ergo\, with the current ~ implementation $a1 will be "\xFF" and $b1 will be "\x{ffff...}" (machine-dependent width).

This goes for all the bytes \x00..\x7F since they cannot be told apart from maybe being "in UTF-8".

And remember backward compatibility​: we shouldn't break old code that expects the bit string arithmetics to work on bytes\, not characters.

I see two ways out of this​:

(1) The UV-wide ~ is used only if

  (1a) SvUTF8 is true   (1b) the whole character string needs to be scanned first   and if a single character > 0xff is met

  Otherwise\, that is\, if no SvUTF8\, or all the characters in the   string are \<= 0xff\, we use byte-wide ~.

(2) We give up completely trying to define *any* bit arithmetics for   character strings and say that ~ | & ^ always work on bytes.

p5pRT commented 23 years ago

From @ysth

Jarkko Hietaniemi \jhi@&#8203;iki\.fi wrote​:

How about this​: if the $string is in utf8​:

Sorry for the very late response to this. But doesn't that "if the $string is in utf8" violate our cardinal rule that the encoding shouldn't affect the results (e.g. ~($x=v1) should be the same as ~(chop($x=v1.300)\,$x).

Yes\, looks like we need to define the semantics of this more tightly.

Consider

$a0 = "\\0";
$b0 = substr\("\\0\\x\{100\}"\, 0\, 1\);

\ $b0=substr(v256.0\,1) :) \

$a1 = ~$a0;
$b1 = ~$b0;

Yes\, I agree it would be nice to have $a1 eq $b1 since $a0 eq $b0. But that's not how it currently goes. $a1 is a pure "byte string"\, it has never been been touched by "a wide character" -- but $b1 is a "character string" since it's "parent" was. Bytewise the $a0 and $b0 are identical but $b1 carries the evil UTF8 flag. Ergo\, with the current ~ implementation $a1 will be "\xFF" and $b1 will be "\x{ffff...}" (machine-dependent width).

This goes for all the bytes \x00..\x7F since they cannot be told apart from maybe being "in UTF-8".

And remember backward compatibility​: we shouldn't break old code that expects the bit string arithmetics to work on bytes\, not characters.

I see two ways out of this​:

(1) The UV-wide ~ is used only if

\(1a\) SvUTF8 is true 
\(1b\) the whole character string needs to be scanned first
     and if a single character > 0xff is met

Otherwise\, that is\, if no SvUTF8\, or all the characters in the
string are \<= 0xff\, we use byte\-wide ~\.

(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.

(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).

But perhaps you are being too concerned about backward compatibility. What do you imagine they are going to do with the result of ~$x that might cause a problem?

How about​:

(3) Unless IN_BYTE\, do ~ character by character. Note that this will   almost certainly produce a string that will only work with the   string bitwise operators\, since UTF8_ALLOW_* will be needed [1].

  If the expense of utf8_to_uv function calls is a concern​:   Rename Perl_utf8_to_uv to Perl_utf8_to_uv_hibit   Make a macro something like (untested)​:

#define Perl_utf8_to_uv(s\,curlen\,retlen\,flags) \   ((UV)*s \< 0x80 ? ((retlen ? (*(STRLEN*)retlen = 1) : 0)\, (UV)*s) \   : utf8_to_uv_hibit(s\,curlen\,retlen\,flags))

Note that at least one place (pp_ord) is already doing a \<0x80 check before calling utf8_to_uv. This kind of thing really should be encapsulated with the utf8 decoding code\, not scattered hither and yon.

Let me know if you'd like to at least see a patch for this.

BTW\, I noticed there is a utf8_to_uv_simple that doesn't seem to be used (at least in the core)​:

=for apidoc Am|U8* s|utf8_to_uv_simple|STRLEN *retlen

Returns the character value of the first character in the string C\ which is assumed to be in UTF8 encoding; C\ will be set to the length\, in bytes\, of that character\, and the pointer C\ will be advanced to the end of the character.

From this description\, I'd expect it to take a U8**\, not a U8*\,STRLEN* It certainly would be more useful that way.

[1] Which of the UTF8_ALLOW_* flags are needed to allow characters 0..2^64-1? All of them? Or do some really indicate malformedness even with perl-extended-utf8? If the latter\, should we have a macro UTF8_ALLOW_ANY_UV? And should this differ with uvsize=4 or 8? I'm inclined to say so.

p5pRT commented 23 years ago

From @jhi

(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.

(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).

No\, I didn't mean truncating to 8 bits\, I meant ignoring the UTF8ness. Bytes. Bytes. Bytes.

But perhaps you are being too concerned about backward compatibility. What do you imagine they are going to do with the result of ~$x that might cause a problem?

ord() it\, for example. 255 is mighty different from 4294967295 or 18446744073709551615.

(3) Unless IN_BYTE\, do ~ character by character. Note that this will almost certainly produce a string that will only work with the string bitwise operators\, since UTF8_ALLOW_* will be needed [1].

Patches welcome.

BTW\, I noticed there is a utf8_to_uv_simple that doesn't seem to be used (at least in the core)​:

Uh? It is used in utf8_to_bytes()\, which is used in sv_utf8_downgrade()\, which is used in e.g. do_vecget().

=for apidoc Am|U8* s|utf8_to_uv_simple|STRLEN *retlen

Returns the character value of the first character in the string C\ which is assumed to be in UTF8 encoding; C\ will be set to the length\, in bytes\, of that character\, and the pointer C\ will be advanced to the end of the character.

From this description\, I'd expect it to take a U8**\, not a U8*\,STRLEN* It certainly would be more useful that way.

[1] Which of the UTF8_ALLOW_* flags are needed to allow characters 0..2^64-1? All of them? Or do some really indicate malformedness even with perl-extended-utf8? If the latter\, should we have a macro UTF8_ALLOW_ANY_UV? And should this differ with uvsize=4 or 8? I'm inclined to say so.

Have to think about this... off-hand\, I do not think all of them\, however\, since at least overlong sequences should still be a no-no\, they serve no useful purpose. See Markus Kuhn's UTF-8 pages​:

  http​://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8   http​://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt   ftp​://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt

The middle one is more or less the UTF-8 decoding law I try to follow in utf8_to_uv().

p5pRT commented 23 years ago

From @ysth

On Wed\, 1 Nov 2000\, Jarkko Hietaniemi wrote​:

(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.

(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).

No\, I didn't mean truncating to 8 bits\, I meant ignoring the UTF8ness. Bytes. Bytes. Bytes.

Which means unexpected behavior when there is a UTF8 upgrade. I'll try to come up with a summary of the different approaches suggested so far and their problems and compatibility issues.

But perhaps you are being too concerned about backward compatibility. What do you imagine they are going to do with the result of ~$x that might cause a problem?

ord() it\, for example. 255 is mighty different from 4294967295 or 18446744073709551615.

ord()ing it will get a Malformed utf warning. This is probably a Good Thing(TM).

(3) Unless IN_BYTE\, do ~ character by character. Note that this will almost certainly produce a string that will only work with the string bitwise operators\, since UTF8_ALLOW_* will be needed [1].

Patches welcome.

BTW\, I noticed there is a utf8_to_uv_simple that doesn't seem to be used (at least in the core)​:

Uh? It is used in utf8_to_bytes()\, which is used in sv_utf8_downgrade()\, which is used in e.g. do_vecget().

Oops\, I missed that.

=for apidoc Am|U8* s|utf8_to_uv_simple|STRLEN *retlen

Returns the character value of the first character in the string C\ which is assumed to be in UTF8 encoding; C\ will be set to the length\, in bytes\, of that character\, and the pointer C\ will be advanced to the end of the character.

From this description\, I'd expect it to take a U8**\, not a U8*\,STRLEN* It certainly would be more useful that way.

[1] Which of the UTF8_ALLOW_* flags are needed to allow characters 0..2^64-1? All of them? Or do some really indicate malformedness even with perl-extended-utf8? If the latter\, should we have a macro UTF8_ALLOW_ANY_UV? And should this differ with uvsize=4 or 8? I'm inclined to say so.

Have to think about this... off-hand\, I do not think all of them\, however\, since at least overlong sequences should still be a no-no\, they serve no useful purpose. See Markus Kuhn's UTF-8 pages​:

http​://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http​://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt ftp​://sunsite.doc.ic.ac.uk/packages/rfc/rfc2279.txt

The middle one is more or less the UTF-8 decoding law I try to follow in utf8_to_uv().

Shouldn't we try to give the warning where the value is created too? (e.g. on the chr\, not just the ord\, of​: ord chr 0xffffffff) Just something to think about...

p5pRT commented 23 years ago

From @jhi

Note that at least one place (pp_ord) is already doing a \<0x80 check before calling utf8_to_uv. This kind of thing really should be encapsulated with the utf8 decoding code\, not scattered hither and yon.

Yes\, I heartily agree. There is testing involving 0x80 and 0xc0 all over the code\, all that needs to be removed and rerouted to use the official utf8 routines\, or macros if speed is the issue.

Let me know if you'd like to at least see a patch for this.

Yes.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

On 1 Nov 2000\, at 10​:03\, Yitzchak Scott-Thoennes wrote​:

Jarkko Hietaniemi \jhi@&#8203;iki\.fi wrote​:

(2) We give up completely trying to define *any* bit arithmetics for character strings and say that ~ | & ^ always work on bytes.

(2) makes a little more sense to me than (1). (Assuming you mean truncating each character to 8 bits\, not just ignoring the UTF8 flag).

I'd tend to go with this\, too (and Yitzchak's specification that characters should be truncated rather than operating on the UTF-8 representation makes a lot of sense\, too).

Cheers\, Philip

p5pRT commented 23 years ago

From The RT System itself

I *think* we have reached the only sensible compromise in this.