Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.88k stars 530 forks source link

Certain regex patterns cause fatal errors with valid UTF-8 #10434

Closed p5pRT closed 13 years ago

p5pRT commented 14 years ago

Migrated from rt.perl.org#75680 (status was 'resolved')

Searchable as RT75680$

p5pRT commented 15 years ago

From @benkasminbullock

This is a bug report for perl from benkasminbullock@​gmail.com\, generated with the help of perlbug 1.36 running under perl 5.10.0.

The following script run on Cygwin prints out an error message

Malformed UTF-8 character (fatal) at ./wwwjdicbug.pl line 75.

However\, the UTF-8 character which is claimed to be malformed comes from a Encode​::decode ('utf8'\,...) statement and then is part of a regular expression match ($3)\, so this seems to be a bug in Perl.

######### wwwjdicbug.pl

#! perl use warnings; use strict; use URI​::Escape 'uri_escape_utf8'; use Encode qw/encode decode/;

package WWWJDIC; use LWP​::UserAgent; use HTML​::TreeBuilder; use Encode qw/encode decode/; use URI​::Escape; use utf8;

my %mirrors = ( japan => 'http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi'\, ); my %dictionaries = (); my %codes = ();

sub new {   my %options = @​_;   my $wwwjdic = {};   if ($options{mirror}) {   my $mirror = lc $options{mirror};   if ($mirrors{$mirror}) {   $wwwjdic->{site} = $mirrors{$mirror};   } else {   print STDERR __PACKAGE__\,"​: unknown mirror '$options{mirror}'​: using Australian site\n";   }   } else {   $wwwjdic->{site} = $mirrors{australia};   }   $wwwjdic->{user_agent} = LWP​::UserAgent->new;   $wwwjdic->{user_agent}->agent(__PACKAGE__);   bless $wwwjdic;   return $wwwjdic; }

# Parse a page of results from WWWJDIC

sub parse_results {   my ($wwwjdic\, $contents) = @​_;   $contents = decode ('utf8'\, $contents);   print $contents;   my $tree = HTML​::TreeBuilder->new();   $tree->parse ($contents);

  my @​labels = $tree->look_down ('_tag'\, 'label');   my @​inputs = $tree->look_down ('_tag'\, 'input');   my %fors;   my @​valid;   for my $input (@​inputs) {   if ($input->attr('name') && $input->attr('name') eq 'jukugosel'   && $input->attr('id')) {   $fors{$input->attr('id')} = $input;   }   }   @​valid = grep {$fors{$_->attr('for')}} @​labels;   for my $line (@​valid) {   my %results;   $results{wwwjdic_id} = $line->attr('id');   my $text = $line->as_text;   print $text\,"\n";   $results{text} = $text;   if ($text =~ /^(.*?)\s*$B!Z(B\s*(.*?)\s*$B![(B\s*(.*?)\s*$/) {   $results{kanji} = $1;   $results{reading} = $2;   $results{meaning} = $3;   } else {   print "Unreadable line '$text'\n";   }   # Get the dictionary from the end of the string.   if ($results{meaning} &&   $results{meaning} =~ /(.*?)\s*([A-Z]{2}[12]?)\s*$/s) {   $results{meaning} = $1;   $results{dictionary} = $2;   }   } }

sub lookup_url {   my ($wwwjdic\, $search_key\, $search_type) = @​_;   my %type;   for (@​$search_type) {   $type{max} = $_ if /^\d+$/;   }   my $url = $wwwjdic->{site};   $url .= "?MMUJ";   my $search_key_encoded = URI​::Escape​::uri_escape_utf8 ($search_key);   $url .= $search_key_encoded;   $url .= "_3";   $url .= '_' . $type{max} if $type{max};   return $url; }

sub lookup {   my ($wwwjdic\, $search_key\, $search_type) = @​_;   my $search_string = $wwwjdic->lookup_url ($search_key\, $search_type);   return if !$search_string;   my $response = $wwwjdic->{user_agent}->get ($search_string);   if ($response->is_success) {   return $wwwjdic->parse_results ($response->content);   } }

sub lookup_kanji {   my ($wwwjdic\, $search_key\, $search_type) = @​_;   my $search_string = $wwwjdic->lookup_url ($search_key\, $search_type);

}

1;

package main;

my $wwwjdic = WWWJDIC​::new(mirror => 'japan'); binmode STDOUT\, "​:encoding(cp932)"; my $arg = '$BAk8}(B'; $arg =~ s/^\s+|\s+$//g; print "Looking up $arg in WWWJDIC​:\n"; $wwwjdic->lookup ($arg\,[20]);

#### Output of ./wwwjdicbug.pl > bug.txt 2>&1

Looking up $BAk8}(B in WWWJDIC​: \<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> \ \\<META http-equiv="Content-Type" content="text/html; charset=UTF-8">\WWWJDIC​: Word Display\ \ \ \<link rel="icon" href="http​://www.csse.monash.edu.au/~jwb/wwwjdic.ico" type="image/x-icon"> \\<BODY onLoad=sf() BGCOLOR="ivory" TEXT="black"> \ \

\\ \\\\\\ \\ \\\\\\\
\<img src="http​://www.aa.tufs.ac.jp/~jwb/jim_th.jpg" align="left">\<span style="font-size​: 9pt; font-family​: Helvetica\, sans-serif; color​: #FFFFFF">Jim Breen's \\
\\WWWJDIC\\\
\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?1C_3_20">Word Search/Home\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?9T_3_20">Translate Words\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?1B_3_20">Kanji Lookup\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?1R_3_20">Multi-Radical Kanji\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/wwwjdicinf.html">User Guide\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/wwwjdicinf.html#dicfil_tag">Dictionaries\ \
\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?10">Example Search\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?17_3_20">New Entry/Amendment\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?14">New Examples\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?19B">Customize\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/wwwjdicinf.html#code_tag">Dictionary Codes\ \\<a href="http​://www.aa.tufs.ac.jp/~jwb/wwwjdicinf.html#don_tag">Donations\ \
\<FORM NAME="inp" ID="inp" ACTION="http​://www.aa.tufs.ac.jp/~jwb/cgi-bin/wwwjdic.cgi?MF_3_20" METHOD="POST" > \  \\
Search Key​: \$BAk8}(B\ Current Dictionary​: \Combined Jpn-Eng \\
\Options​:[G]oogle search\, [GI] Google images\, [S]anseido dictionary\, [A]LC dictionary (Eijiro)\, [Ex]ample sentences\, [V]erb conjugations\, [F] Feedback\, Japanese[W]ikipedia. \ \

\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562616" CHECKED ID="5562616">\\<a href="http​://www.google.com/search?q=%22%C1%EB%B8%FD%22&hl=en&lr=lang_ja&ie=euc-jp">[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562620" ID="5562620">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%A4%CE%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%A4%CE&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%A4%CE/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562621" ID="5562621">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%A4%CE%B7%B8%B0%F7%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%A4%CE%B7%B8%B0%F7&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%A4%CE%B7%B8%B0%F7/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562622" ID="5562622">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%B1%FC%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%B1%FC&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%B1%FC/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562623" ID="5562623">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%B5%AC%C0%A9%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%B5%AC%C0%A9&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%B5%AC%C0%A9/EUC-JP/">[A]\\<a href="http​://ja.wikipedia.org/wiki/%E7%AA%93%E5%8F%A3%E8%A6%8F%E5%88%B6">[W]\

\
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562624" ID="5562624">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%B6%C8%CC%B3%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%B6%C8%CC%B3&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%B6%C8%CC%B3/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562625" ID="5562625">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%BF%A6%B0%F7%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%BF%A6%B0%F7&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%BF%A6%B0%F7/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562626" ID="5562626">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%BF%F4%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%BF%F4&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%BF%F4/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562627" ID="5562627">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%C1%B0%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%C1%B0&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%C1%B0/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562628" ID="5562628">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%C6%E2%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%C6%E2&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%C6%E2/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562629" ID="5562629">\\<a href="http​://www.google.com/search?q=%22%C1%EB%B8%FD%C8%CE%C7%E4%22&hl=en&lr=lang_ja&ie=euc-jp">[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%C8%CE%C7%E4%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%C8%CE%C7%E4&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%C8%CE%C7%E4/EUC-JP/">[A]\ \
\<INPUT TYPE="radio" NAME="jukugosel" VALUE="5562630" ID="5562630">\\[G]\\<a href="http​://images.google.com/images?q=%22%C1%EB%B8%FD%CC%F2%22&hl=en&ie=euc-jp">[GI]\\<a href="http​://dictionary.goo.ne.jp/search.php?MT=%C1%EB%B8%FD%CC%F2&kind=je&mode=1">[S]\\<a href="http​://eow.alc.co.jp/%C1%EB%B8%FD%CC%F2/EUC-JP/">[A]\ \
\ \


\ \ \\ for \ Dictionary​: \ \ \
\Key Type​:\ \<INPUT TYPE="radio" NAME="dsrchtype" VALUE="E" ID="ftrlabel2" CHECKED> \ \<INPUT TYPE="radio" NAME="dsrchtype" VALUE="J" ID="ftrlabel3"> \   \Options​:\ \<INPUT TYPE="checkbox" NAME="firstkanj" ID="ftrlabel4" VALUE="X"> \ \<INPUT TYPE="checkbox" NAME="engpri" ID="ftrlabel5" VALUE="X"> \ \<INPUT TYPE="checkbox" NAME="exactm" ID="ftrlabel6" VALUE="X"> \\
\
\ the kanji in a selected compound (check the compound you wish to examine)\
\ a new EDICT entry based on the selected entry\
\ this search (choose another Dictionary above)\
\ \
\
\ WWWJDIC site​: Japan [TUFS/RILCAA]     &#169; Copyright 2008\, \<a href="http​://www.edrdg.org/">Electronic Dictionary Research and Development Group\. (\<a href="http​://www.csse.monash.edu.au/~jwb/wwwjdicinf.html#copyr_tag">Details\)\\
\ \ $BAk8}(B $B!Z$^$I$0$A![(B (n) (1) ticket window; teller window; counter; (2) contact person; point of contact; (P) $BAk8}$N(B $B!Z$^$I$0$A![(B (?) UNKNOWN; RH $BAk8}$N780w(B $B!Z$^$I$0$A![(B (?) UNKNOWN; RH $BAk8}1|(B $B!Z$^$I$0$A$*$/![(B (?) UNKNOWN; RH $BAk8}5\,@​)(B $B!Z$^$I$0$A$-$;$$![(B (?) UNKNOWN; RH $BAk8}6HL3(B $B!Z$^$I$0$A$.$g$&$`![(B (?) UNKNOWN; RH $BAk8}?&0w(B $B!Z$^$I$0$A$7$g$/$$$s![(B (?) UNKNOWN; RH $BAk8}?t(B $B!Z$^$I$0$A$9$&![(B (?) UNKNOWN; RH $BAk8}A0(B $B!Z$^$I$0$A$^$(![(B (?) UNKNOWN; RH $BAk8}Fb(B $B!Z$^$I$0$A$J$$![(B (?) UNKNOWN; RH $BAk8}HNGd(B $B!Z$^$I$0$A$O(BMalformed UTF-8 character (fatal) at ./wwwjdicbug.pl line 75. $B$s$P$$![(B (n) (See $BAkHN(B) over the counter sales (often of financial packages)

############## End--- Flags​:   category=core   severity=low


Site configuration information for perl 5.10.0​:

Configured by rurban at Mon Jun 30 16​:03​:19 GMT 2008.

Summary of my perl5 (revision 5 version 10 subversion 0 patch 34065) configuration​:   Platform​:   osname=cygwin\, osvers=1.5.25(0.15642)\, archname=cygwin-thread-multi-64int   uname='cygwin_nt-5.1 reini 1.5.25(0.15642) 2008-06-12 19​:34 i686 cygwin '   config_args='-de -Dmksymlinks -Dusethreads -Dmad=y -Dusedevel'   hint=recommended\, useposix=true\, d_sigaction=define   useithreads=define\, usemultiplicity=define   useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef   use64bitint=define\, use64bitall=undef\, uselongdouble=undef   usemymalloc=y\, bincompat5005=undef   Compiler​:   cc='gcc'\, ccflags ='-DPERL_USE_SAFE_PUTENV -U__STRICT_ANSI__ -fno-strict-aliasing -pipe -I/usr/local/include'\,   optimize='-O3'\,   cppflags='-DPERL_USE_SAFE_PUTENV -U__STRICT_ANSI__ -fno-strict-aliasing -pipe -I/usr/local/include'   ccversion=''\, gccversion='3.4.4 (cygming special\, gdc 0.12\, using dmd 0.125)'\, gccosandvers=''   intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=12345678   d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12   ivtype='long long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8   alignbytes=8\, prototype=define   Linker and Libraries​:   ld='g++'\, ldflags =' -Wl\,--enable-auto-import -Wl\,--export-all-symbols -Wl\,--stack\,8388608 -Wl\,--enable-auto-image-base -L/usr/local/lib'   libpth=/usr/local/lib /usr/lib /lib   libs=-lgdbm -ldb -ldl -lcrypt -lgdbm_compat   perllibs=-ldl -lcrypt   libc=/usr/lib/libc.a\, so=dll\, useshrplib=true\, libperl=libperl.a   gnulibc_version=''   Dynamic Linking​:   dlsrc=dl_dlopen.xs\, dlext=dll\, d_dlsymun=undef\, ccdlflags=' '   cccdlflags=' '\, lddlflags=' --shared -Wl\,--enable-auto-import -Wl\,--export-all-symbols -Wl\,--stack\,8388608 -Wl\,--enable-auto-image-base -L/usr/local/lib'

Locally applied patches​:   MAINT34065   CYG11 no-bs   CYG12 no archlib in otherlibdirs   CYG14 Dynaloader   CYG15 static-Win32CORE   Bug#55162 File​::Spec​::case_tolerant performance


@​INC for perl 5.10.0​:   /usr/lib/perl5/5.10/i686-cygwin   /usr/lib/perl5/5.10   /usr/lib/perl5/site_perl/5.10/i686-cygwin   /usr/lib/perl5/site_perl/5.10   /usr/lib/perl5/vendor_perl/5.10/i686-cygwin   /usr/lib/perl5/vendor_perl/5.10   /usr/lib/perl5/vendor_perl/5.10   /usr/lib/perl5/site_perl/5.8   /usr/lib/perl5/vendor_perl/5.8   .


Environment for perl 5.10.0​:   HOME=/cygdrive/c/Documents and Settings/bkb   LANG (unset)   LANGUAGE (unset)   LD_LIBRARY_PATH (unset)   LOGDIR (unset)   PATH=/usr/local/bin​:/usr/bin​:/bin​:/usr/X11R6/bin​:/cygdrive/c/Program Files/Perl/site/bin​:/cygdrive/c/Program Files/Perl/bin​:/cygdrive/c/WINDOWS/system32​:/cygdrive/c/WINDOWS​:/cygdrive/c/WINDOWS/System32/Wbem​:/cygdrive/c/Program Files/MySQL/MySQL Server 5.0/bin​:/cygdrive/c/Documents and Settings/bkb/My Documents/scripts/bin​:   PERL_BADLANG (unset)   SHELL (unset)

p5pRT commented 15 years ago

From @benkasminbullock

This is a very much simplified version of the script which tripped the bug (five lines). I've also simplified the regex drastically until it trips the bug. Shortening the regex from this makes it print "OK" but as it stands the "Malformed UTF-8 character (fatal)" message appears.

p5pRT commented 15 years ago

From @benkasminbullock

tinytest.pl

p5pRT commented 15 years ago

@benkasminbullock - Status changed from 'new' to 'open'

p5pRT commented 14 years ago

From hector@debian.org

Created by hector@debian.org

executing this (which works correctly on perl 5.8 gives an error

#!/usr/bin/perl -w

use utf8; use encoding 'utf8';

my $p = 'á d\

'; #my $p = 'す d\

';

print "$p\n";

if ($p =~ m#(.*?)[-]?EFE\s*\

$#gsm) {   print "yes $1\n"; }else{   print "no\n"; }

hector@​baloo​:/tmp$ ./kk.pl á d\

Malformed UTF-8 character (fatal) at ./kk.pl line 11.

The script fails for any utf8 definition of $p

This regression has been tested also on a perl vanilla compilation on another server.

Perl Info ``` Flags: category=core severity=critical Site configuration information for perl 5.10.1: Configured by Debian Project at Sun Feb 7 16:19:05 UTC 2010. Summary of my perl5 (revision 5 version 10 subversion 1) configuration: Platform: osname=linux, osvers=2.6.26-2-amd64, archname=i486-linux-gnu-thread-multi uname='linux biber 2.6.26-2-amd64 #1 smp tue jan 12 22:12:20 utc 2010 i686 gnulinux ' config_args='-Dusethreads -Duselargefiles -Dccflags=-DDEBIAN -Dcccdlflags=-fPIC -Darchname=i486-linux-gnu -Dprefix=/usr -Dprivlib=/usr/share/perl/5.10 -Darchlib=/usr/l ib/perl/5.10 -Dvendorprefix=/usr -Dvendorlib=/usr/share/perl5 -Dvendorarch=/usr/lib/perl5 -Dsiteprefix=/usr/local -Dsitelib=/usr/local/share/perl/5.10.1 -Dsitearch=/usr/lo cal/lib/perl/5.10.1 -Dman1dir=/usr/share/man/man1 -Dman3dir=/usr/share/man/man3 -Dsiteman1dir=/usr/local/man/man1 -Dsiteman3dir=/usr/local/man/man3 -Dman1ext=1 -Dman3ext=3 perl -Dpager=/usr/bin/sensible-pager -Uafs -Ud_csh -Ud_ualarm -Uusesfio -Uusenm -DDEBUGGING=-g -Doptimize=-O2 -Duseshrplib -Dlibperl=libperl.so.5.10.1 -Dd_dosuid -des' hint=recommended, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2 -g', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.4.3 20100108 (prerelease)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /usr/lib64 libs=-lgdbm -lgdbm_compat -ldb -ldl -lm -lpthread -lc -lcrypt perllibs=-ldl -lm -lpthread -lc -lcrypt libc=/lib/libc-2.10.2.so, so=so, useshrplib=true, libperl=libperl.so.5.10.1 gnulibc_version='2.10.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -O2 -g -L/usr/local/lib -fstack-protector' Locally applied patches: DEBPKG:debian/arm_thread_stress_timeout - http://bugs.debian.org/501970 Raise the timeout of ext/threads/shared/t/stress.t to accommodate slower build hosts DEBPKG:debian/cpan_config_path - Set location of CPAN::Config to /etc/perl as /usr may not be writable. DEBPKG:debian/cpan_definstalldirs - Provide a sensible INSTALLDIRS default for modules installed from CPAN. DEBPKG:debian/db_file_ver - http://bugs.debian.org/340047 Remove overly restrictive DB_File version check. DEBPKG:debian/doc_info - Replace generic man(1) instructions with Debian-specific information. DEBPKG:debian/enc2xs_inc - http://bugs.debian.org/290336 Tweak enc2xs to follow symlinks and ignore missing @INC directories. DEBPKG:debian/errno_ver - http://bugs.debian.org/343351 Remove Errno version check due to upgrade problems with long-running processes. DEBPKG:debian/extutils_hacks - Various debian-specific ExtUtils changes DEBPKG:debian/fakeroot - Postpone LD_LIBRARY_PATH evaluation to the binary targets. DEBPKG:debian/instmodsh_doc - Debian policy doesn't install .packlist files for core or vendor. DEBPKG:debian/ld_run_path - Remove standard libs from LD_RUN_PATH as per Debian policy. DEBPKG:debian/libnet_config_path - Set location of libnet.cfg to /etc/perl/Net as /usr may not be writable. DEBPKG:debian/m68k_thread_stress - http://bugs.debian.org/495826 Disable some threads tests on m68k for now due to missing TLS. DEBPKG:debian/mod_paths - Tweak @INC ordering for Debian DEBPKG:debian/module_build_man_extensions - http://bugs.debian.org/479460 Adjust Module::Build manual page extensions for the Debian Perl policy DEBPKG:debian/perl_synopsis - http://bugs.debian.org/278323 Rearrange perl.pod DEBPKG:debian/prune_libs - http://bugs.debian.org/128355 Prune the list of libraries wanted to what we actually need. DEBPKG:debian/use_gdbm - Explicitly link against -lgdbm_compat in ODBM_File/NDBM_File. DEBPKG:fixes/assorted_docs - http://bugs.debian.org/443733 [384f06a] Math::BigInt::CalcEmu documentation grammar fix DEBPKG:fixes/net_smtp_docs - http://bugs.debian.org/100195 [rt.cpan.org #36038] Document the Net::SMTP 'Port' option DEBPKG:fixes/processPL - http://bugs.debian.org/357264 [rt.cpan.org #17224] Always use PERLRUNINST when building perl modules. DEBPKG:debian/perlivp - http://bugs.debian.org/510895 Make perlivp skip include directories in /usr/local DEBPKG:fixes/pod2man-index-backslash - http://bugs.debian.org/521256 Escape backslashes in .IX entries DEBPKG:debian/disable-zlib-bundling - Disable zlib bundling in Compress::Raw::Zlib DEBPKG:fixes/kfreebsd_cppsymbols - http://bugs.debian.org/533098 [3b910a0] Add gcc predefined macros to $Config{cppsymbols} on GNU/kFreeBSD. DEBPKG:debian/cpanplus_definstalldirs - http://bugs.debian.org/533707 Configure CPANPLUS to use the site directories by default. DEBPKG:debian/cpanplus_config_path - Save local versions of CPANPLUS::Config::System into /etc/perl. DEBPKG:fixes/kfreebsd-filecopy-pipes - http://bugs.debian.org/537555 [16f708c] Fix File::Copy::copy with pipes on GNU/kFreeBSD DEBPKG:fixes/anon-tmpfile-dir - http://bugs.debian.org/528544 [perl #66452] Honor TMPDIR when open()ing an anonymous temporary file DEBPKG:fixes/abstract-sockets - http://bugs.debian.org/329291 [89904c0] Add support for Abstract namespace sockets. DEBPKG:fixes/hurd_cppsymbols - http://bugs.debian.org/544307 [eeb92b7] Add gcc predefined macros to $Config{cppsymbols} on GNU/Hurd. DEBPKG:fixes/autodie-flock - http://bugs.debian.org/543731 Allow for flock returning EAGAIN instead of EWOULDBLOCK on linux/parisc DEBPKG:fixes/archive-tar-instance-error - http://bugs.debian.org/539355 [rt.cpan.org #48879] Separate Archive::Tar instance error strings from each other DEBPKG:fixes/positive-gpos - http://bugs.debian.org/545234 [perl #69056] [c584a96] Fix \\G crash on first match DEBPKG:debian/devel-ppport-ia64-optim - http://bugs.debian.org/548943 Work around an ICE on ia64 DEBPKG:debian/dynaloader-config - http://bugs.debian.org/549170 Make DynaLoader work without Config_heavy.pl again DEBPKG:fixes/trie-logic-match - http://bugs.debian.org/552291 [perl #69973] [0abd0d7] Fix a DoS in Unicode processing [CVE-2009-3626] DEBPKG:fixes/hppa-thread-eagain - http://bugs.debian.org/554218 make the threads-shared test suite more robust, fixing failures on hppa DEBPKG:fixes/crash-on-undefined-destroy - http://bugs.debian.org/564074 [perl #71952] [1f15e67] Fix a NULL pointer dereference when looking for a DESTROY method DEBPKG:patchlevel - http://bugs.debian.org/567489 List packaged patches for 5.10.1-11 in patchlevel.h @INC for perl 5.10.1: /etc/perl /usr/local/lib/perl/5.10.1 /usr/local/share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 /usr/share/perl/5.10 /usr/local/lib/site_perl . Environment for perl 5.10.1: HOME=/home/hector LANG=es_ES.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/opt/drbl/sbin:/opt/drbl/bin:/home/hector/bin:/opt/drbl/sbin:/opt/drbl/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 14 years ago

From @ikegami

On Mon\, Mar 22\, 2010 at 6​:13 AM\, Hector Garcia \perlbug\-followup@&#8203;perl\.orgwrote​:

# New Ticket Created by Hector Garcia # Please include the string​: [perl #73732] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=73732 >

This is a bug report for perl from hector@​debian.org\, generated with the help of perlbug 1.39 running under perl 5.10.1.

----------------------------------------------------------------- [Please describe your issue here]

executing this (which works correctly on perl 5.8 gives an error

#!/usr/bin/perl -w

use utf8; use encoding 'utf8';

my $p = 'á d\

'; #my $p = 'す d\

';

print "$p\n";

if ($p =~ m#(.*?)[-]?EFE\s*\

$#gsm) { print "yes $1\n"; }else{ print "no\n"; }

hector@​baloo​:/tmp$ ./kk.pl á d\

Malformed UTF-8 character (fatal) at ./kk.pl line 11.

Thanks for the report.

Workaround until this is fixed​:

if ($p =~ m#(?​:|(?!)\x{2660})(.*?)[-]?EFE\s*\

$#sm) {

Note that I removed the /g. "if (/.../g)" rarely makes any sense and can produce undesirable results.

p5pRT commented 14 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 14 years ago

From @khwilliamson

Eric Brine wrote​:

On Mon\, Mar 22\, 2010 at 6​:13 AM\, Hector Garcia \perlbug\-followup@&#8203;perl\.orgwrote​:

# New Ticket Created by Hector Garcia # Please include the string​: [perl #73732] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=73732 >

This is a bug report for perl from hector@​debian.org\, generated with the help of perlbug 1.39 running under perl 5.10.1.

----------------------------------------------------------------- [Please describe your issue here]

executing this (which works correctly on perl 5.8 gives an error

#!/usr/bin/perl -w

use utf8; use encoding 'utf8';

my $p = 'á d\

'; #my $p = 'す d\

';

print "$p\n";

if ($p =~ m#(.*?)[-]?EFE\s*\

$#gsm) { print "yes $1\n"; }else{ print "no\n"; }

hector@​baloo​:/tmp$ ./kk.pl á d\

Malformed UTF-8 character (fatal) at ./kk.pl line 11.

Thanks for the report.

Workaround until this is fixed​:

if ($p =~ m#(?​:|(?!)\x{2660})(.*?)[-]?EFE\s*\

$#sm) {

Note that I removed the /g. "if (/.../g)" rarely makes any sense and can produce undesirable results.

I wonder if this is related to #46563​: g suffix on string search (/.../g) can cause string corruption

which is a won't fix

p5pRT commented 14 years ago

From @ikegami

On Mon\, Mar 22\, 2010 at 11​:47 PM\, karl williamson \public@&#8203;khwilliamson\.comwrote​:

I wonder if this is related to #46563​: g suffix on string search (/.../g) can cause string corruption

which is a won't fix

The /g is not germane to the bug. The workaround wasn't the removal of the /g\, it's the addition of >8-bit char to the pattern.

p5pRT commented 14 years ago

From @nwc10

On Mon\, Mar 22\, 2010 at 09​:47​:07PM -0600\, karl williamson wrote​:

I wonder if this is related to #46563​: g suffix on string search (/.../g) can cause string corruption

which is a won't fix

http​://rt.perl.org/rt3/Ticket/Display.html?id=46563

  For now and for older perls this bug is firmly in the "wont fix"   category. Sorry.

It wasn't yet described as a "won't" fix if it's still in current blead. (I couldn't seem to replicate it even on 5.10.0\, so I'm not sure what the state of the bug is)

Nicholas Clark

p5pRT commented 14 years ago

From @khwilliamson

Nicholas Clark wrote​:

On Mon\, Mar 22\, 2010 at 09​:47​:07PM -0600\, karl williamson wrote​:

I wonder if this is related to #46563​: g suffix on string search (/.../g) can cause string corruption

which is a won't fix

http​://rt.perl.org/rt3/Ticket/Display.html?id=46563

For now and for older perls this bug is firmly in the "wont fix" category. Sorry.

It wasn't yet described as a "won't" fix if it's still in current blead. (I couldn't seem to replicate it even on 5.10.0\, so I'm not sure what the state of the bug is)

Nicholas Clark

I just tried it\, and it is still a bug in 5.12RC0.

p5pRT commented 14 years ago

From rivero@raulrivero.es

On Lun. Mar. 22 20​:47​:43 2010\, public@​khwilliamson.com wrote​:

Eric Brine wrote​:

On Mon\, Mar 22\, 2010 at 6​:13 AM\, Hector Garcia \<perlbug- followup@​perl.org>wrote​:

# New Ticket Created by Hector Garcia # Please include the string​: [perl #73732] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=73732 >

This is a bug report for perl from hector@​debian.org\, generated with the help of perlbug 1.39 running under perl 5.10.1.

----------------------------------------------------------------- [Please describe your issue here]

executing this (which works correctly on perl 5.8 gives an error

#!/usr/bin/perl -w

use utf8; use encoding 'utf8';

my $p = 'á d\

'; #my $p = 'す d\

';

print "$p\n";

if ($p =~ m#(.*?)[-]?EFE\s*\

$#gsm) { print "yes $1\n"; }else{ print "no\n"; }

hector@​baloo​:/tmp$ ./kk.pl á d\

Malformed UTF-8 character (fatal) at ./kk.pl line 11.

Thanks for the report.

Workaround until this is fixed​:

if ($p =~ m#(?​:|(?!)\x{2660})(.*?)[-]?EFE\s*\

$#sm) {

Note that I removed the /g. "if (/.../g)" rarely makes any sense and can produce undesirable results.

I wonder if this is related to #46563​: g suffix on string search (/.../g) can cause string corruption

which is a won't fix

The /g isn't the problem​:


#!/usr/bin/perl -w

use utf8; use encoding 'utf8';

my $p = 'á d\

'; #my $p = 'す d\

';

print "$p\n";

if ($p =~ m#(.*?)[-]?EFE\s*\

$#sm) {   print "yes $1\n"; }else{   print "no\n"; }


$ perl problem.pl á d\

Malformed UTF-8 character (fatal) at kk.pl line 11.

And "m#(?​:|(?!)\x{2660})(.*?)[-]?EFE\s*\

$#sm" isn't a real workaround. This was just only an example of the problem

If we change the (.*) and we use (\X*)\, it works. So\, we think there is some problem with wide characters and the '.' in regular expressions. Surprisingly\, it works with 5.8.

We could fix it with this patch​:

Inline Patch ```diff --- regcomp.c.OLD 2010-03-24 10:15:59.381767760 +0100 +++ regcomp.c 2010-03-24 10:17:03.068877134 +0100 @@ -6932,7 +6932,7 @@ ret = reg_node(pRExC_state, SANY); else ret = reg_node(pRExC_state, REG_ANY); - *flagp |= HASWIDTH|SIMPLE; + *flagp |= HASWIDTH; RExC_naughty++; Set_Node_Length(ret, 1); /* MJD */ break; ```

Any idea?

Cheers\,

p5pRT commented 14 years ago

From hector@debian.org

This bug has nothing to do with bug 46563 If you take out the /g from the example I originally send\, you'll see the bug it is still there.

Thanks

p5pRT commented 14 years ago

From @iabyn

On Tue\, Mar 23\, 2010 at 02​:58​:58PM -0600\, karl williamson wrote​:

I just tried it\, and it is still a bug in 5.12RC0.

And here is a minimal(ish) case that triggers a 'Malformed UTF-8 character' warning​:

  $_ = "\x{e1} d\

\x{100}";   chop $_;   print "match\n" if m{(.*?)-\s\

$};

-- You're only as old as you look.

p5pRT commented 14 years ago

From doug@ablegrape.com

Created by doug@ablegrape.com

This is a bug report for perl from doug@​ablegrape.com\, generated with the help of perlbug 1.39 running under perl v5.8.9.

----------------------------------------------------------------- My program worked fine under previous versions of Perl on MacOS (prior to Snow Leopard).

Now it dies under 5.8.9\, 5.10.0 and 5.12.1\, with "Malformed UTF-8 character (fatal)" - but the input data is the same\, and is\, as far as I can tell\, perfectly valid UTF-8.

I've isolated the failure to a test case\, included here\, which shows a simple expression that works\, two (very) slightly more complex expressions that fail\, and the original complex expression from my code. As far as I can tell\, all of these should work. Oddly\, if I add "use encoding 'utf8'" even the simple regex fails.

My best guess is that perhaps for some reason the regex engine is backing up by bytes within my string\, and starting in the middle of a character. The string itself is perfectly valid.

#!/usr/bin/perl

use strict vars; use utf8; binmode STDOUT\, "​:utf8";

my $e = "Böck";

if (utf8​::is_utf8($e)) { print "yep\, is UTF8​: $e\n"; }

# this succeeds (failed before with use encoding 'utf8'\, unknown why) if ($e=~ m/.*?[x]$/) { print "matched simple\n"; } print "success with simple\n";

# these die if ($e=~ m/.*?\p{Space}$/i) { print "matched medium\n"; }
print "success with medium\n"; if ($e=~ m/.*?[xyz]$/) { print "matched medium\n"; } print "success with medium\n";

# the original\, full expression. if ($e =~ m/(.*?)[\,\p{isSpace}]+((?​:\p{isAlpha}[\p{isSpace}\.]{1\,2})+)\p{isSpace}*$/) { print "matched complex\n"; } print "success with complex\n";

Perl Info ``` Flags: category=core severity=critical Site configuration information for perl v5.8.9: Configured by _postfix at Wed Jun 24 00:32:40 PDT 2009. Summary of my perl5 (revision 5 version 8 subversion 9) configuration: Platform: osname=darwin, osvers=10.0, archname=darwin-thread-multi-2level uname='darwin neige.apple.com 10.0 darwin kernel version 10.0.0d8: tue may 5 19:29:59 pdt 2009; root:xnu-1437.2~2release_i386 i386 ' config_args='-ds -e -Dprefix=/usr -Dccflags=-g -pipe -Dldflags= -Dman3ext=3pm -Duseithreads -Duseshrplib -Dinc_version_list=none -Dcc=gcc-4.2' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc-4.2', ccflags ='-arch i386 -arch ppc -g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -I/usr/local/include', optimize='-Os', cppflags='-g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='4.2.1 (Apple Inc. build 5646)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc-4.2 -mmacosx-version-min=10.6', ldflags ='-arch i386 -arch ppc -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-ldbm -ldl -lm -lutil -lc perllibs=-ldl -lm -lutil -lc libc=/usr/lib/libc.dylib, so=dylib, useshrplib=true, libperl=libperl.dylib gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-arch i386 -arch ppc -bundle -undefined dynamic_lookup -L/usr/local/lib' Locally applied patches: /Library/Perl/Updates/ comes before system perl directories installprivlib and installarchlib points to the Updates directory 6576362: fixed 5.8.9 binary compatibility issue: perlio mutex not initialized @INC for perl v5.8.9: /Library/Perl/Updates/5.8.9 /System/Library/Perl/5.8.9/darwin-thread-multi-2level /System/Library/Perl/5.8.9 /Library/Perl/5.8.9/darwin-thread-multi-2level /Library/Perl/5.8.9 /Network/Library/Perl/5.8.9/darwin-thread-multi-2level /Network/Library/Perl/5.8.9 /Network/Library/Perl /System/Library/Perl/Extras/5.8.9/darwin-thread-multi-2level /System/Library/Perl/Extras/5.8.9 /Library/Perl/5.8.8 /Library/Perl/5.8.6/darwin-thread-multi-2level /Library/Perl/5.8.6 /Library/Perl/5.8.1 . Environment for perl v5.8.9: DYLD_LIBRARY_PATH (unset) HOME=/Users/cook LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:/opt/subversion/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/local/mysql/bin:/sw/bin:/Volumes/SEA_DISC/NutchStuff/nutch//my_scripts:/opt/local/bin PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 14 years ago

From bitcard@candiru.com

FYI\, discussion of this bug on Perlmonks​:

http​://www.perlmonks.org/?node_id=843208

p5pRT commented 14 years ago

From @cowens

As a work around\, I suggest you use the \x{} literal escape​:

my $e = "B\x{f6}ck";

It seems to work on my OS X machines.

On Fri\, Jun 11\, 2010 at 15​:15\, Doug Cook \perlbug\-followup@&#8203;perl\.org wrote​:

# New Ticket Created by  Doug Cook # Please include the string​:  [perl #75680] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=75680 >

This is a bug report for perl from doug@​ablegrape.com\, generated with the help of perlbug 1.39 running under perl v5.8.9.

----------------------------------------------------------------- My program worked fine under previous versions of Perl on MacOS (prior to Snow Leopard).

Now it dies under 5.8.9\, 5.10.0 and 5.12.1\, with "Malformed UTF-8 character (fatal)" - but the input data is the same\, and is\, as far as I can tell\, perfectly valid UTF-8.

I've isolated the failure to a test case\, included here\, which shows a simple expression that works\, two (very) slightly more complex expressions that fail\, and the original complex expression from my code. As far as I can tell\, all of these should work. Oddly\, if I add "use encoding 'utf8'" even the simple regex fails.

My best guess is that perhaps for some reason the regex engine is backing up by bytes within my string\, and starting in the middle of a character. The string itself is perfectly valid.

#!/usr/bin/perl

use strict vars; use utf8; binmode STDOUT\, "​:utf8";

my $e = "Böck";

if (utf8​::is_utf8($e)) { print "yep\, is UTF8​: $e\n"; }

# this succeeds (failed before with use encoding 'utf8'\, unknown why) if ($e=~ m/.*?[x]$/) { print "matched simple\n"; } print "success with simple\n";

# these die if ($e=~ m/.*?\p{Space}$/i) { print "matched medium\n"; } print "success with medium\n"; if ($e=~ m/.*?[xyz]$/) { print "matched medium\n"; } print "success with medium\n";

# the original\, full expression. if ($e =~ m/(.*?)[\,\p{isSpace}]+((?​:\p{isAlpha}[\p{isSpace}\.]{1\,2})+)\p{isSpace}*$/) { print "matched complex\n"; } print "success with complex\n";

[Please do not change anything below this line] ----------------------------------------------------------------- --- Flags​:    category=core    severity=critical --- Site configuration information for perl v5.8.9​:

Configured by _postfix at Wed Jun 24 00​:32​:40 PDT 2009.

Summary of my perl5 (revision 5 version 8 subversion 9) configuration​:  Platform​:    osname=darwin\, osvers=10.0\, archname=darwin-thread-multi-2level    uname='darwin neige.apple.com 10.0 darwin kernel version 10.0.0d8​: tue may 5 19​:29​:59 pdt 2009; root​:xnu-1437.2~2release_i386 i386 '    config_args='-ds -e -Dprefix=/usr -Dccflags=-g  -pipe  -Dldflags= -Dman3ext=3pm -Duseithreads -Duseshrplib -Dinc_version_list=none -Dcc=gcc-4.2'    hint=recommended\, useposix=true\, d_sigaction=define    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef    use64bitint=define use64bitall=define uselongdouble=undef    usemymalloc=n\, bincompat5005=undef  Compiler​:    cc='gcc-4.2'\, ccflags ='-arch i386 -arch ppc -g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -I/usr/local/include'\,    optimize='-Os'\,    cppflags='-g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -I/usr/local/include'    ccversion=''\, gccversion='4.2.1 (Apple Inc. build 5646)'\, gccosandvers=''    intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234    d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16    ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8    alignbytes=8\, prototype=define  Linker and Libraries​:    ld='gcc-4.2 -mmacosx-version-min=10.6'\, ldflags ='-arch i386 -arch ppc -L/usr/local/lib'    libpth=/usr/local/lib /usr/lib    libs=-ldbm -ldl -lm -lutil -lc    perllibs=-ldl -lm -lutil -lc    libc=/usr/lib/libc.dylib\, so=dylib\, useshrplib=true\, libperl=libperl.dylib    gnulibc_version=''  Dynamic Linking​:    dlsrc=dl_dlopen.xs\, dlext=bundle\, d_dlsymun=undef\, ccdlflags=' '    cccdlflags=' '\, lddlflags='-arch i386 -arch ppc -bundle -undefined dynamic_lookup -L/usr/local/lib'

Locally applied patches​:    /Library/Perl/Updates/\ comes before system perl directories    installprivlib and installarchlib points to the Updates directory    6576362​: fixed 5.8.9 binary compatibility issue​: perlio mutex not initialized

--- @​INC for perl v5.8.9​:    /Library/Perl/Updates/5.8.9    /System/Library/Perl/5.8.9/darwin-thread-multi-2level    /System/Library/Perl/5.8.9    /Library/Perl/5.8.9/darwin-thread-multi-2level    /Library/Perl/5.8.9    /Network/Library/Perl/5.8.9/darwin-thread-multi-2level    /Network/Library/Perl/5.8.9    /Network/Library/Perl    /System/Library/Perl/Extras/5.8.9/darwin-thread-multi-2level    /System/Library/Perl/Extras/5.8.9    /Library/Perl/5.8.8    /Library/Perl/5.8.6/darwin-thread-multi-2level    /Library/Perl/5.8.6    /Library/Perl/5.8.1    .

--- Environment for perl v5.8.9​:    DYLD_LIBRARY_PATH (unset)    HOME=/Users/cook    LANG=en_US.UTF-8    LANGUAGE (unset)    LD_LIBRARY_PATH (unset)    LOGDIR (unset)    PATH=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin​:/opt/subversion/bin​:/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/local/bin​:/usr/X11/bin​:/usr/local/mysql/bin​:/sw/bin​:/Volumes/SEA_DISC/NutchStuff/nutch//my_scripts​:/opt/local/bin    PERL_BADLANG (unset)    SHELL=/bin/bash

-- Chas. Owens wonkden.net The most important skill a programmer can have is the ability to read.

p5pRT commented 14 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 14 years ago

From @khwilliamson

Chas. Owens wrote​:

As a work around\, I suggest you use the \x{} literal escape​:

my $e = "B\x{f6}ck";

It seems to work on my OS X machines.

Unfortunately the reason this workaround works is because it avoids upgrading $e to utf8. If you use "B\x{101}ck" instead\, the malformed remains. Also\, because of an unrelated bug\, /i matching will not work properly for \x{f6}.

On Fri\, Jun 11\, 2010 at 15​:15\, Doug Cook \perlbug\-followup@&#8203;perl\.org wrote​:

# New Ticket Created by Doug Cook # Please include the string​: [perl #75680] # in the subject line of all future correspondence about this issue. # \<URL​: http​://rt.perl.org/rt3/Ticket/Display.html?id=75680 >

This is a bug report for perl from doug@​ablegrape.com\, generated with the help of perlbug 1.39 running under perl v5.8.9.

----------------------------------------------------------------- My program worked fine under previous versions of Perl on MacOS (prior to Snow Leopard).

Now it dies under 5.8.9\, 5.10.0 and 5.12.1\, with "Malformed UTF-8 character (fatal)" - but the input data is the same\, and is\, as far as I can tell\, perfectly valid UTF-8.

I've isolated the failure to a test case\, included here\, which shows a simple expression that works\, two (very) slightly more complex expressions that fail\, and the original complex expression from my code. As far as I can tell\, all of these should work. Oddly\, if I add "use encoding 'utf8'" even the simple regex fails.

My best guess is that perhaps for some reason the regex engine is backing up by bytes within my string\, and starting in the middle of a character. The string itself is perfectly valid.

#!/usr/bin/perl

use strict vars; use utf8; binmode STDOUT\, "​:utf8";

my $e = "Böck";

if (utf8​::is_utf8($e)) { print "yep\, is UTF8​: $e\n"; }

# this succeeds (failed before with use encoding 'utf8'\, unknown why) if ($e=~ m/.*?[x]$/) { print "matched simple\n"; } print "success with simple\n";

# these die if ($e=~ m/.*?\p{Space}$/i) { print "matched medium\n"; } print "success with medium\n"; if ($e=~ m/.*?[xyz]$/) { print "matched medium\n"; } print "success with medium\n";

# the original\, full expression. if ($e =~ m/(.*?)[\,\p{isSpace}]+((?​:\p{isAlpha}[\p{isSpace}\.]{1\,2})+)\p{isSpace}*$/) { print "matched complex\n"; } print "success with complex\n";

[Please do not change anything below this line] ----------------------------------------------------------------- --- Flags​: category=core severity=critical --- Site configuration information for perl v5.8.9​:

Configured by _postfix at Wed Jun 24 00​:32​:40 PDT 2009.

Summary of my perl5 (revision 5 version 8 subversion 9) configuration​: Platform​: osname=darwin\, osvers=10.0\, archname=darwin-thread-multi-2level uname='darwin neige.apple.com 10.0 darwin kernel version 10.0.0d8​: tue may 5 19​:29​:59 pdt 2009; root​:xnu-1437.2~2release_i386 i386 ' config_args='-ds -e -Dprefix=/usr -Dccflags=-g -pipe -Dldflags= -Dman3ext=3pm -Duseithreads -Duseshrplib -Dinc_version_list=none -Dcc=gcc-4.2' hint=recommended\, useposix=true\, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler​: cc='gcc-4.2'\, ccflags ='-arch i386 -arch ppc -g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -I/usr/local/include'\, optimize='-Os'\, cppflags='-g -pipe -fno-common -DPERL_DARWIN -fno-strict-aliasing -I/usr/local/include' ccversion=''\, gccversion='4.2.1 (Apple Inc. build 5646)'\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries​: ld='gcc-4.2 -mmacosx-version-min=10.6'\, ldflags ='-arch i386 -arch ppc -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-ldbm -ldl -lm -lutil -lc perllibs=-ldl -lm -lutil -lc libc=/usr/lib/libc.dylib\, so=dylib\, useshrplib=true\, libperl=libperl.dylib gnulibc_version='' Dynamic Linking​: dlsrc=dl_dlopen.xs\, dlext=bundle\, d_dlsymun=undef\, ccdlflags=' ' cccdlflags=' '\, lddlflags='-arch i386 -arch ppc -bundle -undefined dynamic_lookup -L/usr/local/lib'

Locally applied patches​: /Library/Perl/Updates/\ comes before system perl directories installprivlib and installarchlib points to the Updates directory 6576362​: fixed 5.8.9 binary compatibility issue​: perlio mutex not initialized

--- @​INC for perl v5.8.9​: /Library/Perl/Updates/5.8.9 /System/Library/Perl/5.8.9/darwin-thread-multi-2level /System/Library/Perl/5.8.9 /Library/Perl/5.8.9/darwin-thread-multi-2level /Library/Perl/5.8.9 /Network/Library/Perl/5.8.9/darwin-thread-multi-2level /Network/Library/Perl/5.8.9 /Network/Library/Perl /System/Library/Perl/Extras/5.8.9/darwin-thread-multi-2level /System/Library/Perl/Extras/5.8.9 /Library/Perl/5.8.8 /Library/Perl/5.8.6/darwin-thread-multi-2level /Library/Perl/5.8.6 /Library/Perl/5.8.1 .

--- Environment for perl v5.8.9​: DYLD_LIBRARY_PATH (unset) HOME=/Users/cook LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin​:/opt/subversion/bin​:/usr/bin​:/bin​:/usr/sbin​:/sbin​:/usr/local/bin​:/usr/X11/bin​:/usr/local/mysql/bin​:/sw/bin​:/Volumes/SEA_DISC/NutchStuff/nutch//my_scripts​:/opt/local/bin PERL_BADLANG (unset) SHELL=/bin/bash

p5pRT commented 14 years ago

From @druud62

Doug Cook wrote​:

My program worked fine under previous versions of Perl on MacOS (prior to Snow Leopard).

Now it dies under 5.8.9\, 5.10.0 and 5.12.1\, with "Malformed UTF-8 character (fatal)" - but the input data is the same\, and is\, as far as I can tell\, perfectly valid UTF-8.

It could well be that your editor saves the source as either UTF-8 or ISO-8859-1. Did you check the input data at the byte level?

-- Ruud

p5pRT commented 13 years ago

From @tsee

According to Yves\, this was fixed by commit v5.13.4-25-g92f3d48.

--Steffen

p5pRT commented 13 years ago

@tsee - Status changed from 'open' to 'resolved'

p5pRT commented 13 years ago

From @cpansprout

This appears to have been fixed. It may be the same bug as #75680.

p5pRT commented 13 years ago

From @cpansprout

On Sun Sep 05 14​:52​:42 2010\, sprout wrote​:

This appears to have been fixed. It may be the same bug as #75680.

Yes\, it is the same. I’m marking this as resolved.

p5pRT commented 13 years ago

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT commented 13 years ago

From @cpansprout

On Tue Jul 29 19​:46​:08 2008\, BKB wrote​:

This is a very much simplified version of the script which tripped the bug (five lines). I've also simplified the regex drastically until it trips the bug. Shortening the regex from this makes it print "OK" but as it stands the "Malformed UTF-8 character (fatal)" message appears.

Thank you for your report.

You have ‘use utf8’ in your script\, which signals to perl that your source code is in UTF-8.

But then you have a string containing the octets 95 B6\, which is not valid UTF-8. This results in an invalid scalar\, so perl croaks. This behaviour is correct.

You do not need ‘use utf8’ if you are just *using* Unicode strings.

p5pRT commented 13 years ago

@cpansprout - Status changed from 'open' to 'rejected'

p5pRT commented 13 years ago

From @benkasminbullock

I'm pretty sure I filed a very much simpler example of this bug after that one (it was more than two years ago).

I don't think there was anything wrong with the utf8 etc.\, that is just smoke-blowing.

On 20 September 2010 05​:48\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Tue Jul 29 19​:46​:08 2008\, BKB wrote​:

This is a very much simplified version of the script which tripped the bug (five lines). I've also simplified the regex drastically until it trips the bug. Shortening the regex from this makes it print "OK" but as it stands the "Malformed UTF-8 character (fatal)" message appears.

Thank you for your report.

You have ‘use utf8’ in your script\, which signals to perl that your source code is in UTF-8.

But then you have a string containing the octets 95 B6\, which is not valid UTF-8. This results in an invalid scalar\, so perl croaks. This behaviour is correct.

You do not need ‘use utf8’ if you are just *using* Unicode strings.

p5pRT commented 13 years ago

From @cpansprout

On Sun Sep 19 21​:21​:17 2010\, BKB wrote​:

I'm pretty sure I filed a very much simpler example of this bug after that one (it was more than two years ago).

I don't think there was anything wrong with the utf8 etc.\, that is just smoke-blowing.

I only looked at your reduced case at first. It was failing for the reason I mentioned.

Your original script can be reduced to​:

perl -le' "(n) (See \x{7a93}\x{8ca9}) over the counter sales (often of financial packages)" =~ /(.*?)\s*([A-Z]{2}[12]?)\s*$/s'

It is the same as 75680 and 73732\, which were fixed by 92f3d4829170316374b610b3fc665389803d93f8.

p5pRT commented 13 years ago

@cpansprout - Status changed from 'rejected' to 'resolved'

p5pRT commented 13 years ago

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Sun Sep 19 21​:21​:17 2010\, BKB wrote​:

I'm pretty sure I filed a very much simpler example of this bug after that one (it was more than two years ago).

I don't think there was anything wrong with the utf8 etc.\, that is just smoke-blowing.

I only looked at your reduced case at first. It was failing for the reason I mentioned.

Your original script can be reduced to​:

perl -le' "(n) (See \x{7a93}\x{8ca9}) over the counter sales (often of financial packages)" =~ /(.*?)\s*([A-Z]{2}[12]?)\s*$/s'

It is the same as 75680 and 73732\, which were fixed by 92f3d4829170316374b610b3fc665389803d93f8.

And this fix made it into 5.12.2\, which is now an official Perl release available at http​://search.cpan.org/~jesse/perl-5.12.2/