regex result vars no longer marked as utf8

p5pRT commented 22 years ago

Migrated from rt.perl.org#7943 (status was 'resolved')

Searchable as RT7943$

p5pRT commented 22 years ago

From root@schmorp.de

This has been reported around 5.7.2\, I think (but am not very sure)\, had been fixed (AFAICR I sent a patch)\, but current devel still/again has this or a very similar bug.

use Encode;

my $utf8 = "\x{1234}";

$utf8 =~ /(.)/;

$str = $1; Encode::is_utf8 $str or die; func($1);

sub func { Encode::is_utf8 $_[0] or die; }

It dies inside func ($1 isn't utf8\, while $str is).

Perl Info

``` Flags: category=core severity=high Site configuration information for perl v5.7.2: Configured by root at Sat Nov 24 03:47:08 CET 2001. Summary of my perl5 (revision 5.0 version 7 subversion 2 patch 13229) configuration: Platform: osname=linux, osvers=2.4, archname=i686-linux-stdio uname='linux cerebro 2.4.8-ac9 #7 smp thu aug 30 00:15:46 cest 2001 i686 unknown ' config_args='' hint=previous, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='gcc-2.95.4', ccflags ='-fno-strict-aliasing -I/usr/local/include -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -g -Os -funroll-loops -march=pentium -mcpu=pentium', optimize='-g -Os -march=pentium -mcpu=pentium -funroll-loops', cppflags='-fno-strict-aliasing -I/usr/local/include -I/opt/include -fno-strict-aliasing -I/usr/local/include -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' ccversion='', gccversion='2.95.4 20010319 (prerelease)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc-2.95.4', ldflags ='-L/usr/local/lib -L/opt/lib' libpth=/usr/local/lib /lib /usr/lib /opt/lib libs=-ldl -lm -lc -lcrypt perllibs=-ldl -lm -lc -lcrypt libc=/lib/libc-2.2.4.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib -L/opt/lib' Locally applied patches: DEVEL13225 @INC for perl v5.7.2: /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 /usr/app/lib/perl5 . Environment for perl v5.7.2: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=de_DE LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s2:/root/s:/opt/qt/bin:/bin:/usr/bin:/usr/app/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/usr/app/bin:/usr/app/sbin:/usr/X11/bin:/opt/jdk118/bin:/opt/bin:/opt/sbin:.:/root/cc/dejagnu/bin PERLDB_OPTS=ornaments=0 PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 22 years ago

From @jhi

The problem seems to be in Encode::is_utf8 since the regex results *are* marked UTF8 as you can see from the modified version:

use Devel::Peek; my $utf8 = "\x{1234}"; $utf8 =~ /(.)/; $str = $1; printf "%x\n"\, ord($str)\, "\n"; Dump($str); func($1); sub func { printf "%x\n"\, ord($_[0])\, "\n"; Dump($_[0]); }

Just out of curiosity: *why* are you using Encode::is_utf8? I'm not saying the behaviour you are seeing is okay\, but I'm just curious as to why do you feel you need to use the Encode::is_utf8?

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From @jhi

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

No\, they aren't. They aren't in the much larger version of the example script. I do this:

while ($self->peek =~ /^\s*(?:msgstr\s+)?\"(.*)\"\s*$/) { $self->{line} =~ /^\s*(?:msgstr\s+)?\"(.*)\"\s*$/; #d# DEVEL9021\, redo regex on var $str .= PApp::I18n::unquote $1;

If I do this instead:

$self->{line} =~ /^\s*(?:msgstr\s+)?\"(.*)\"\s*$/; #d# DEVEL9021\, redo regex on var $str = $1; Convert::Scalar::utf8_on $str; $str .= PApp::I18n::unquote $1;

Then everything works.

One thing that unquote does is to call utf8_upgrade (from Convert::Scalar):

sub unquote($) { local $_ = shift; utf8_upgrade $_; #d# DEVEL7952 ...

and this re-encodes $_ to utf8\, creating garbage. The DEVELxxxx comments just desfribe various workarounds for various devel versions of perl. The upgrade should be a NOP under perl\, in 7952 it isn't\, since regexes only worked on utf-8 strings there. Conceptually\, utf8_upgrade doesn't hurt there\, so I left it in.

utf8_upgrade is defined as:

void
utf8_upgrade(scalar) SV * scalar PROTOTYPE: $ PPCODE: sv_utf8_upgrade(scalar);

And this works fine if scalar is marked as utf8 (it does nothing). So $_ certainly is _not_ marked as utf-8 here.

Just out of curiosity: *why* are you using Encode::is_utf8? I'm not

I usually use Convert::Scalar (older ;)\, but I thought my bugreport would be easier to reproduce if I used Encode instead\, a module you wouldn't need to install from CPAN.

saying the behaviour you are seeing is okay\, but I'm just curious as to why do you feel you need to use the Encode::is_utf8?

I know no other way to check for utf-8-ness\, please enlighten me ;)

-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |

p5pRT commented 22 years ago

From @jhi

I'm sorry but this is not true. They are marked UTF-8. See the Devel::Peek output in the example script I sent to you. The UTF8 flag is in there.

Whether all code that handles them has been correctly made UTF-8-aware is a different story. The problem you just reported was caused by is_utf8() not being prepared for the $1 (or\, in general\, scalars having GMAGIC: the scalar was did not have POK\, so it wasn't test for the UTF8 flag.)

And this works fine if scalar is marked as utf8 (it does nothing). So $_ certainly is _not_ marked as utf-8 here.

Try Devel::Peek.

Just out of curiosity: *why* are you using Encode::is_utf8? I'm not

I usually use Convert::Scalar (older ;)\, but I thought my bugreport would be easier to reproduce if I used Encode instead\, a module you wouldn't need to install from CPAN.

Unless Convert::Scalar is uptodate with the core as to what UTF-8-ness means\, you may be in for nasty surprises. No\, I don't know what the uptodateness means since I have no experience with Convert::Scalar.

saying the behaviour you are seeing is okay\, but I'm just curious as to why do you feel you need to use the Encode::is_utf8?

I know no other way to check for utf-8-ness\, please enlighten me ;)

From where I stand\, caring about UTF-8-ness is as pointless as if with numbers one would be worrying all the time about whether the number is evenly divisible by 73. It should not matter. The arithmetics still work the same. You should not have to care. The same goes for UTF8. All that magic should be happening automatically. If it doesn't\, if there are bugs in the Encode routines\, fine\, let's fix them\, but you should not be calling the utf8 routines directly. I'm sorry but I do not know how to explain in this any more explicitly.

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

I said that thsi is only the case in the (now known to be flawed) testcase I sent in. It caught a bug\, nevertheless.

is_utf8() not being prepared for the $1 (or\, in general\, scalars having GMAGIC: the scalar was did not have POK\, so it wasn't test for the UTF8 flag.)

Yupp.

And this works fine if scalar is marked as utf8 (it does nothing). So $_ certainly is _not_ marked as utf-8 here.

Try Devel::Peek.

Shows the bug\, but as soon as I make the testcase smaller\, it starts working.

I know no other way to check for utf-8-ness\, please enlighten me ;)

From where I stand\, caring about UTF-8-ness is as pointless as if with numbers one would be worrying all the time about whether the number is

You seem in an ideal world. I have to use perl\, though\, which\, with regards to utf-8\, is very very far from ideal. Also\, speed matters in the real world\, and getting my program to not always convert ebwteen utf-8 and latin1 internally shaved off about 5%.

It should not matter. The arithmetics still work the same. You should not have to care. The same goes for UTF8.

You know that I do not agree with you here ;) I simply cannot see how this can work. I need to know the encoding of data in myriads of places (DBI\, File-I/O\, many hundreds of modules on CPAN...).

there are bugs in the Encode routines\, fine\, let's fix them\, but you should not be calling the utf8 routines directly. I'm sorry but I do not know how to explain in this any more explicitly.

If there were no reason to ever use these routines\, why provide them at all? Something must be wrong with that picture.

Maybe the need for these functions will be small in a decade or so\, but I cannot see how to work without these in the real world.

-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

Moin\,

Jarkko Hietaniemi \jhi@iki\.fi wrote:

I know no other way to check for utf-8-ness\, please enlighten me ;)

From where I stand\, caring about UTF-8-ness is as pointless as if with numbers one would be worrying all the time about whether the number is evenly divisible by 73. It should not matter. The arithmetics still work the same. You should not have to care. The same goes for UTF8. All that magic should be happening automatically. If it doesn't\, if there are bugs in the Encode routines\, fine\, let's fix them\, but you should not be calling the utf8 routines directly. I'm sorry but I do not know how to explain in this any more explicitly.

Since that is the second case of someone using Encode::is_utf8() (apparently for the wrong thing)\, could it be renamed to _is_utf8() and marked as "Dont use it - for internal usage only"?

Cheers\,

- -- perl -MDev::Bollocks -e'print Dev::Bollocks->rand()\,"\n"' carefully pursue wireless deliverables

http://bloodgate.com/perl My current Perl projects PGP key available on http://bloodgate.com/tels.asc or via email

-----BEGIN PGP SIGNATURE----- Version: 2.6.3i Charset: latin1

iQEVAwUBPAP7T3cLPEOTuEwVAQHYhQf+KUM40ZdVwlS85hETLtep/D+FJiUe/Eqv igd7KWgYvf+p7e9AA0VAuxpugIln6uaaLPUPtZawhRhhQz2PEnfonOfWsxs62Hx1 Cle3XZavIFg+NB+5rKmmzLsYp2B09IeWyzXXHrxyFi72CRfgG8ik0N3SG7iG6FPZ zlZgBbkodabaCTk2uJDr1EcuoXKVy801bPAKrZ3cV9TLFBXcyh5zlEPHTKqSs8tF DynP4yJlh86Z0qSQMOrsRVMW/PnOqN3ExCP9sdI7BKKcWV/+fJVHN/6yNJaCYEmu o1TqkFoTlDOd9cryEfPKLrWxfAfmgP0lCcqONXIZE3HZuH8YDK6kHw== =j2cZ -----END PGP SIGNATURE-----

p5pRT commented 22 years ago

From @jhi

Yes. Or\, at least\, that's where I'd like to take Perl. I'd like people not to have play with the UTF-8 encoding directly.

regards to utf-8\, is very very far from ideal. Also\, speed matters in the real world\, and getting my program to not always convert ebwteen utf-8 and latin1 internally shaved off about 5%.

It should not matter. The arithmetics still work the same. You should not have to care. The same goes for UTF8.

You know that I do not agree with you here ;) I simply cannot see how this can work. I need to know the encoding of data in myriads of places (DBI\, File-I/O\, many hundreds of modules on CPAN...).

I still don't understand why are you are using the Encode interfaces. See below.

there are bugs in the Encode routines\, fine\, let's fix them\, but you should not be calling the utf8 routines directly. I'm sorry but I do not know how to explain in this any more explicitly.

If there were no reason to ever use these routines\, why provide them at all? Something must be wrong with that picture.

They are available as the last resort. Unless you applications are explicitly doing character set and encoding conversions\, you should care little about whether something is UTF-8 or not\, or see below.

Maybe the need for these functions will be small in a decade or so\, but I cannot see how to work without these in the real world.

You still haven't explained why do you need them. Give us an example application that needs to take a random string and either flip its UTF-8 flag on\, or test for it. Because from Perl's viewpoint the question is silly: data gets its UTF-8 flag turned on when it needs it\, and the flag is on when it needs to be on. Data needs to be UTF-8 when it's either explicitly created to high codepoints (>255) or when read in from a filehandle marked to do encoding conversions.

(...thinking really hard why would anyone...)

Okay\, came up with one scenario: if you have data that is "mixed"\, that is\, some bytes of it are valid UTF-8\, other bytes aren't (or you don't care whether they are)\, then you would need the ability to call the Encode::*utf8* explicitly. Something like this:

file.txt: foobar = "..."; goobar = "...";

where the ... is assumedly valid UTF-8\, and you read the file in\, extract the byte sequence ... (using regular expressions\, for example)\, and then you want to flip the UTF-8 bit on. Is this what you mean by your "real world"? If so\, and the "..." some day happens to be INvalid UTF-8\, don't complain to me as Perl goes into convulsions :-)

I feel like after having tried to build a good bridge over the river people asking me for spare planks and nails since they want to build from the leftovers a pontoon bridge underneath the actual bridge.

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

Maybe the need for these functions will be small in a decade or so\, but I cannot see how to work without these in the real world.

You still haven't explained why do you need them.

Ehrm... we seem to have a big miscommunication here. As I told you before\, I only used encode to show the bug (which I cannot put into a small testcase). I could just as well have used Devel::Peek...

application that needs to take a random string and either flip its UTF-8 flag on\, or test for it.

That bug script? ;)

Because from Perl's viewpoint the question is silly: data gets its UTF-8 flag turned on when it needs

Again\, this were true\, if perl would work reliably. It is far from that (after all\, it's a developer version and I was doomed to base my product on utf-8 when perl wasn't ready)\, and various versions (including the latest development snapshot) simply do not work as you seem to claim it does.

In short: utf-8 simply doesn't work in neither perl version I have ever seen. To be able to work around these bugs (for example) is important for me. Just as well as inter-operating with the many cpan modules that know\, what\, nothing about utf-8. Think DBI\, think Mysql\, think Gtk. I don't use Encode to do that\, though. It was _just_ an example.

it\, and the flag is on when it needs to be on. Data needs to be UTF-8 when it's either explicitly created to high codepoints (>255) or when read in from a filehandle marked to do encoding

Well\, input/output is another example. perl-snapshots assume charsets on filehandles when perlio is turned on (a major compatibility problem) and don't do so when turned off etc..

(...thinking really hard why would anyone...)

I simply don't believe it makes sense to tell people: "don't concern yourself with real world\, perl will fix it for you". I simply need to know wether my string is in utf-8 or not\, in too many cases.

that is\, some bytes of it are valid UTF-8\, other bytes aren't (or you don't care whether they are)\, then you would need the ability to call the Encode::*utf8* explicitly.

Not true. I could always use normal perl functionality for that. It is\, however\, vastly more convinient ;)

I feel like after having tried to build a good bridge over the river people asking me for spare planks and nails since they want to build from the leftovers a pontoon bridge underneath the actual bridge.

Well\, I like regexes that adapt to utf-8\, and I like automatic conversion on the perl level. I just don't believe that trying to abstract characters from an actual internal encoding makes sense in the real world. If I have a JPEG file in $jpg and it gets converted to utf-8\, I loose small. If I pass it to some Xs function and this XS function has to ake an actual copy of that string I loose big.

I mean\, as long as it works out nicely (because I can depend on when and what perl does internally)\, I don't care. But in that case I know what's going on anyways.

Of course\, more important is that bug. Did you get my Devel::Peek output? I does it ring a bell somehow? ;) I am at a loss here... I'll still try to make a testcase\, but...

-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |

p5pRT commented 22 years ago

From @jhi

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

I realyl don't understand your frustration - I just reported a bug (two\, actually\, as you found out).

If you cannot fix it due to lack of information from my side\, that's ok.

I never said anything to the contrary.

I thought I've done a lot\, but the current bleadperl is pretty much what you are going to get as Perl 5.8.0. If that's not enough\, I'm sorry.

We have a big communication problem here\, really. I never argued against perl-5.8.0\, or your work. To the contrary\, I was very impressed with the speed quality my bug report was answered. What _is_ your problem with me?

-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |

p5pRT commented 22 years ago

From @jhi

I think we better take this off-list\, a mailing list is not a good place to sort out communication mismatches.

-- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen

p5pRT commented 22 years ago

From @nwc10

This is where I agree with Marc\, I believe.

Until every XS module everywhere (both on CPAN and off it) knows to check the UTF8 flag on any SV it calls SvPV on\, then it's going to get a nasty surprise if perl internally converts the buffer to utf8.

5.8 cannot fix this. 5.10 cannot fix this. perl 6 will fix this\, because all extensions that work with perl6 will be new.

I'm not saying the work on 5.8 is wasted. Or that 5.8's Unicode is broken. Or that extensions that manipulate binary data that are changed to be aware of SvUTF8() won't work. Just that un-vetted XS code that calls SvPV might trip up. Introduction of utf8 has subtly changed the assumptions of the XS interface.

Nicholas Clark

p5pRT commented 22 years ago

From [Unknown Contact. See original ticket]

(Btw\, I hope I settled this in private discussion: I also don't think this. I wouldn't complain if I thought this\, I'd rather not use perl. I like perl. I use perl-5.7+ exclusively\, because I like it so much).

-- -----==- | ----==-- _ | ---==---(_)__ __ ____ __ Marc Lehmann +-- --==---/ / _ \/ // /\ \/ / pcg@goof.com |e| -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE --+ The choice of a GNU generation | |

p5pRT commented 21 years ago

From @jhi

I believe this problem was fixed by 5.8.0. I am marking the ticket as resolved.

p5pRT commented 21 years ago

@jhi - Status changed from 'open' to 'resolved'

Perl / perl5