Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 559 forks source link

chr(65535) should be allowed in regexes #8290

Closed p5pRT closed 14 years ago

p5pRT commented 18 years ago

Migrated from rt.perl.org#38293 (status was 'resolved')

Searchable as RT38293$

p5pRT commented 18 years ago

From pcg@goof.com

Created by pcg@goof.com

Currently\, perl doesn't like chr(65535) in regexes​:

Malformed UTF-8 character (character 0xffff) in regexp compilation at /opt/rxvt/lib/urxvt/perl/readline line 26.

(and some other places).

It does allow chr(65534)\, though.

Both characters are guarenteed illegal unicode characters\, and are intended for process internal uses.

It is very handy to process data and have an "illegal character" character\, so being able to use chr(65535) for process internal uses seems fine to me.

Certainly being able to match a wrong byte-order mark is helpful\, too\, and allowed\, even thow its not a legal unicode character.

So I would expect either both these (and other illegal unicode characters) are not allowed in regexes\, or none.

Besides\, the error message is misleading​: perl can store unicode internally either as latin1 or as utf-8\, but this shoudn't be exposed to the user. The regex string in question was _not_ UTF-8 encoded in perl\, but normal text (i.e. chr(65535) is not utf-8\, as utf-8 never has bytes

255. IT is represented as utf-8 in perl internally\, but on the perl level\, it isn't).

Perl Info ``` Flags: category=core severity=low Site configuration information for perl v5.8.6: Configured by Marc Lehmann at Sat Mar 19 00:58:06 UTC 2005. Summary of my perl5 (revision 5 version 8 subversion 6) configuration: Platform: osname=linux, osvers=2.6.10, archname=amd64-linux uname='linux cerebro 2.6.10 #1 smp wed jan 26 00:24:47 cet 2005 x86_64 gnulinux ' config_args='-Duselargefiles -Uuse64bitint -Uuse64bitall -Dusemymalloc=y -Dcc=gcc-3.4 -Dccflags=-ggdb -Dcppflags=-D_GNU_SOURCE -I/opt/include -Doptimize=-O4 -march=pentium3 -mtune=pentium3 -funroll-loops -fno-strict-aliasing -Dcccdlflags=-fPIC -Dldflags=-L/opt/perl/lib -L/opt/lib -Dlibs=-ldl -lm -lcrypt -Darchname=amd64-linux -Dprefix=/opt/perl -Dprivlib=/opt/perl/lib/perl5 -Darchlib=/opt/perl/lib/perl5 -Dvendorprefix=/opt/perl -Dvendorlib=/opt/perl/lib/perl5 -Dvendorarch=/opt/perl/lib/perl5 -Dsiteprefix=/opt/perl -Dsitelib=/opt/perl/lib/perl5 -Dsitearch=/opt/perl/lib/perl5 -Dsitebin=/opt/perl/bin -Dman1dir=/opt/perl/man/man1 -Dman3dir=/opt/perl/man/man3 -Dsiteman1dir=/opt/perl/man/man1 -Dsiteman3dir=/opt/perl/man/man3 -Dman1ext=1 -Dman3ext=3 -Dpager=/usr/bin/less -Uafs -Uusesfio -Uusenm -Uuseshrplib -Dd_dosuid -Dusethreads=undef -Duse5005threads=undef -Duseithreads=undef -Dusemultiplicity=undef -Demail=perl-binary@plan9.de -Dcf_email=perl-binary@plan9.de -Dcf_by=Marc Lehmann -Dlocincpth=/opt/perl/include /opt/include -Dmyhostname=localhost -Dmultiarch=undef -Dbin=/opt/perl/bin -des' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='gcc-3.4', ccflags ='-ggdb -fno-strict-aliasing -pipe -I/opt/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O4 -march=pentium3 -mtune=pentium3 -funroll-loops -fno-strict-aliasing', cppflags='-D_GNU_SOURCE -I/opt/include -ggdb -fno-strict-aliasing -pipe -I/opt/include' ccversion='', gccversion='3.4.4 20050203 (prerelease) (Debian 3.4.3-9)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc-3.4', ldflags ='-L/opt/perl/lib -L/opt/lib -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib /usr/ccs/lib libs=-ldl -lm -lcrypt perllibs=-ldl -lm -lcrypt libc=/lib/libc-2.3.2.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.3.2' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -L/opt/perl/lib -L/opt/lib -L/usr/local/lib' Locally applied patches: @INC for perl v5.8.6: /root/src/sex /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 /opt/perl/lib/perl5 . Environment for perl v5.8.6: HOME=/root LANG (unset) LANGUAGE (unset) LC_CTYPE=de_DE.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/root/s2:/root/s:/opt/bin:/opt/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/X11/bin:/usr/games:/root/src/uunet:. PERL5LIB=/root/src/sex PERL5_CPANPLUS_CONFIG=/root/.cpanplus/config PERLDB_OPTS=ornaments=0 PERL_BADLANG (unset) PERL_UNICODE=SAL SHELL=/bin/bash ```
p5pRT commented 18 years ago

From BQW10602@nifty.com

On Fri\, 20 Jan 2006 14​:42​:00 -0800\, Marc Lehmann (via RT) \perlbug\-followup@​perl\.org wrote

Currently\, perl doesn't like chr(65535) in regexes​:

Malformed UTF-8 character (character 0xffff) in regexp compilation at /opt/rxvt/lib/urxvt/perl/readline line 26.

(and some other places).

It does allow chr(65534)\, though.

To allow it\, perl requires the statement of \< no warnings 'utf8'; >.

#!perl use warnings 'utf8'; print chr(65535) =~ /\p{Noncharacter_Code_Point}/ ? "yes" : "boo"; __END__ Unicode character 0xffff is illegal at ... Malformed UTF-8 character (character 0xffff) in pattern match (m//) at ... Malformed UTF-8 character (character 0xffff) in pattern match (m//) at ... boo

#!perl no warnings 'utf8'; print chr(65535) =~ /\p{Noncharacter_Code_Point}/ ? "yes" : "boo"; __END__ yes

[in utf8.h] #define UNICODE_ILLEGAL 0xFFFF

U+FFFF is a Unicode scalar value (that means it is valid according to the Unicode standard) and its byte sequence is well-formed. I don't know which *law* does its use break.

Regards\, SADAHIRO Tomoyuki

p5pRT commented 18 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 18 years ago

From nospam-abuse@bloodgate.com

Moin\,

Sadahiro wrote​:

On Fri\, 20 Jan 2006 14​:42​:00 -0800\, Marc Lehmann (via RT)

Currently\, perl doesn't like chr(65535) in regexes​:

Malformed UTF-8 character (character 0xffff) in regexp compilation at /opt/rxvt/lib/urxvt/perl/readline line 26.

(and some other places).

It does allow chr(65534)\, though.

To allow it\, perl requires the statement of \< no warnings 'utf8'; >.

The warnings stems from this bit of code​:

U8 * Perl_uvuni_to_utf8_flags(pTHX_ U8 *d\, UV uv\, UV flags) {   if (ckWARN(WARN_UTF8)) {   if (UNICODE_IS_SURROGATE(uv) &&   !(flags & UNICODE_ALLOW_SURROGATE))   Perl_warner(aTHX_ packWARN(WARN_UTF8)\, "UTF-16 surrogate 0x%04"UVxf\, uv);   else if (   ((uv >= 0xFDD0 && uv \<= 0xFDEF &&   !(flags & UNICODE_ALLOW_FDD0))   ||   ((uv & 0xFFFE) == 0xFFFE && /* Either FFFE or FFFF. */   !(flags & UNICODE_ALLOW_FFFF))) &&   /* UNICODE_ALLOW_SUPER includes   * FFFEs and FFFFs beyond 0x10FFFF. */   ((uv \<= PERL_UNICODE_MAX) ||   !(flags & UNICODE_ALLOW_SUPER))   )   Perl_warner(aTHX_ packWARN(WARN_UTF8)\,   "Unicode character 0x%04"UVxf" is illegal"\, uv);   }

Maybe the test is botched up? It also seems to test an awfull lot of stuff...

Best wishes\,

Tels

-- Signed on Sat Jan 21 13​:07​:57 2006 with key 0x93B84C15. Visit my photo gallery at http​://bloodgate.com/photos/ PGP key on http​://bloodgate.com/tels.asc or per email.

I'm a Wei-wei-wei-wei-wei-wow-wow-wow-wow-wow-wow-wizzzaahrd...

p5pRT commented 18 years ago

From nospam-abuse@bloodgate.com

Moin\,

grrepping the blead source for UNICODE_ALLOW_FFFF​:

  utf8.c​: !(flags & UNICODE_ALLOW_FFFF))) &&   utf8.h​:#define UNICODE_ALLOW_FFFF 0x0004 /* Allow 0xFFF[EF]\,
0x1FFF[EF]\, ... */

So it is defined and checked\, but not used elsewhere\, anytime. Hm.

Either the constant (0x0004) is used hard-coded somewhere\, or it is never set on the flags\, and thus can never be true...

A​:

grep "0x[0]*4 " * -r

doesn't yield any results\, maybe it is used as flags += 4; or something - though I doubt it.

Disabling the warnings just works around the bug that FFFF is not allowed and there seems to be no way to actually allow it.

Best wishes\,

Tels

-- Signed on Sat Jan 21 13​:14​:41 2006 with key 0x93B84C15. Visit my photo gallery at http​://bloodgate.com/photos/ PGP key on http​://bloodgate.com/tels.asc or per email.

"Sacrificing minions​: Is there any problem it CAN'T solve?" -- Lord Xykon

p5pRT commented 18 years ago

From schmorp@schmorp.de

On Sat\, Jan 21\, 2006 at 04​:18​:21AM -0800\, Tels via RT \perlbug\-followup@&#8203;perl\.org wrote​:

doesn't yield any results\, maybe it is used as flags += 4; or something - though I doubt it.

Disabling the warnings just works around the bug that FFFF is not allowed and there seems to be no way to actually allow it.

Well\, not allowing FFFF has some merit\, too\, but either all illegal codepoints should be disallowed or none at all.

The "Malformed UTF-8"... is also not quite a warning\, as the resulting regex won't work (works neither in s/// nor in y///\, and its not related to character constants)​:

  # perl -e '$c = chr 65535; $c=~s/$c//g; print $c'|xxd   Malformed UTF-8 character (character 0xffff) in regexp compilation at -e line 1.   Malformed UTF-8 character (character 0xffff) in regexp compilation at -e line 1.   0000000​: efbf bf ...

--   The choice of a   -----==- _GNU_   ----==-- _ generation Marc Lehmann   ---==---(_)__ __ ____ __ pcg@​goof.com   --==---/ / _ \/ // /\ \/ / http​://schmorp.de/   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 18 years ago

From schmorp@schmorp.de

On Fri\, Jan 20\, 2006 at 08​:00​:40PM -0800\, SADAHIRO Tomoyuki via RT \perlbug\-followup@&#8203;perl\.org wrote​:

Malformed UTF-8 character (character 0xffff) in regexp compilation at /opt/rxvt/lib/urxvt/perl/readline line 26. (and some other places). It does allow chr(65534)\, though.

To allow it\, perl requires the statement of \< no warnings 'utf8'; >.

I verified that "no warnings 'utf8' actually gets rid of the warning. Its still being ignored​:

  # perl -e 'no warnings 'utf8'; $c = chr 65535; $c=~s/$c//g; print $c'|xxd   0000000​: efbf bf
  # perl -e 'no warnings 'utf8'; $c = chr 65534; $c=~s/$c//g; print $c'|xxd

Both commands shouldn't output anything.

[in utf8.h] #define UNICODE_ILLEGAL 0xFFFF

U+FFFF is a Unicode scalar value (that means it is valid according to the Unicode standard) and its byte sequence is well-formed. I don't know which *law* does its use break.

http​://www.unicode.org/charts/PDF/UFFF0.pdf

Look at the code table for that range. Both U+FFFE and U+FFFF are described as "These codes are intended for process internal uses\, but are not permitted for interchange."

To me\, it means my use in regexes is fine\, as long as I don't write it to a file and claim its valid unicode in some encoding.

--   The choice of a   -----==- _GNU_   ----==-- _ generation Marc Lehmann   ---==---(_)__ __ ____ __ pcg@​goof.com   --==---/ / _ \/ // /\ \/ / http​://schmorp.de/   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 18 years ago

From schmorp@schmorp.de

On Sat\, Jan 21\, 2006 at 04​:11​:19AM -0800\, Tels via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Fri\, 20 Jan 2006 14​:42​:00 -0800\, Marc Lehmann (via RT)

Currently\, perl doesn't like chr(65535) in regexes​:

It does allow chr(65534)\, though.

To allow it\, perl requires the statement of \< no warnings 'utf8'; >.

The warnings stems from this bit of code​:

                     "Unicode character 0x%04"UVxf" is illegal"\, uv\);

Different message\, it seems.

The actual message is (note I get it twice in this example)​:

  cerebro ~# perl -e 'y/\x{ffff}//'   Malformed UTF-8 character (character 0xffff) at -e line 1.   Malformed UTF-8 character (character 0xffff) at -e line 1.

which doesn't match the code below. I would love to be able to control this on a case-per-case basis (chr 65535 is quite common as record separator\, or simply to signify "no character here"\, for example).

However\, the real issues is the confusing message (no UTF-8 anywhere\, this is not the pelr unicode model) _and_ the fact that it isn't caring for both 65553 and 65534 the same way.

It would be nice to have some control over that\, as this improves perl's fitness for all sorts of tasks\, for example\, its nice to be able to store code points higher than the highets supported unicode code point. Its not unicode anymore\, but highly useful. Nobody would either limit perls string to characters available in the current locale\, either.

However\, those are not real issues :)

Maybe the test is botched up? It also seems to test an awfull lot of stuff...

There are a lot of non-unicode code-points in the unicode range (the 65534/65535 pair is mirrored in every page)\, so that part looks fine\, in theory.

--   The choice of a   -----==- _GNU_   ----==-- _ generation Marc Lehmann   ---==---(_)__ __ ____ __ pcg@​goof.com   --==---/ / _ \/ // /\ \/ / http​://schmorp.de/   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 18 years ago

From @ysth

On Sat\, Jan 21\, 2006 at 01​:17​:25PM +0100\, Tels wrote​:

Moin\,

grrepping the blead source for UNICODE_ALLOW_FFFF​:

    utf8\.c&#8203;:             \!\(flags & UNICODE\_ALLOW\_FFFF\)\)\) &&
    utf8\.h&#8203;:\#define UNICODE\_ALLOW\_FFFF       0x0004  /\* Allow 

0xFFF[EF]\,
0x1FFF[EF]\, ... */

So it is defined and checked\, but not used elsewhere\, anytime. Hm.

Either the constant (0x0004) is used hard-coded somewhere\, or it is never set on the flags\, and thus can never be true...

UNICODE_ALLOW_ANY (which includes the UNICODE_ALLOW_FFFF flag) is used in several places where we want to allow any arbitrary character without warning.

A​:

grep "0x[0]*4 " * -r

doesn't yield any results\, maybe it is used as flags += 4; or something - though I doubt it.

Disabling the warnings just works around the bug that FFFF is not allowed and there seems to be no way to actually allow it.

I don't understand what you are saying. The code in utf8.c only issues a warning; it has no other effect. If the warning is turned off\, it does nothing\, and the character can be considered allowed.

p5pRT commented 18 years ago

From @ysth

On Sat\, Jan 21\, 2006 at 06​:44​:16PM +0100\, Marc Lehmann wrote​:

On Fri\, Jan 20\, 2006 at 08​:00​:40PM -0800\, SADAHIRO Tomoyuki via RT \perlbug\-followup@&#8203;perl\.org wrote​:

Malformed UTF-8 character (character 0xffff) in regexp compilation at /opt/rxvt/lib/urxvt/perl/readline line 26. (and some other places). It does allow chr(65534)\, though.

To allow it\, perl requires the statement of \< no warnings 'utf8'; >.

I verified that "no warnings 'utf8' actually gets rid of the warning. Its still being ignored​:

# perl -e 'no warnings 'utf8'; $c = chr 65535; $c=~s/$c//g; print $c'|xxd 0000000​: efbf bf
# perl -e 'no warnings 'utf8'; $c = chr 65534; $c=~s/$c//g; print $c'|xxd

Verified in blead and maint. Note that the problem is just with an interpolated character​:

$ perl5.9.3 -XDr -e'$c="\x{ffff}"; /$c/' Omitting $` $& $' support.

EXECUTING...

Compiling REx "o??" size 3 Got 28 bytes for offset annotations. first at 1 rarest char at 0   1​: EXACT \<\x{0}>(3)   3​: END(0) anchored utf8 "" at 0 (checking anchored isall) minlen 1 Offsets​: [3]   1[3] 0[0] 4[0] Freeing REx​: \x{ffff}

$ perl5.9.3 -XDr -e'/\x{ffff}/' Compiling REx "\x{ffff}" size 3 Got 28 bytes for offset annotations. first at 1 rarest char ? at 1   1​: EXACT \<\x{ffff}>(3)   3​: END(0) anchored utf8 "o??" at 0 (checking anchored isall) minlen 1 Offsets​: [3]   1[7] 0[0] 9[0] Omitting $` $& $' support.

EXECUTING...

Freeing REx​: \\x{ffff}

I'm thinking all the utf8n_to_uv* calls in regcomp should use UTF8_ALLOW_ANYUV. The only point I see in warning about things like \x{ffff} would be if they were literal utf8 in the source code\, and it looks to me like those are caught outside of pregcomp.

p5pRT commented 18 years ago

From nospam-abuse@bloodgate.com

Moin\,

On Sunday 22 January 2006 09​:05\, Yitzchak Scott-Thoennes wrote​:

On Sat\, Jan 21\, 2006 at 01​:17​:25PM +0100\, Tels wrote​:

Moin\,

grrepping the blead source for UNICODE_ALLOW_FFFF​:

    utf8\.c&#8203;:             \!\(flags & UNICODE\_ALLOW\_FFFF\)\)\) &&
    utf8\.h&#8203;:\#define UNICODE\_ALLOW\_FFFF       0x0004  /\* Allow

0xFFF[EF]\, 0x1FFF[EF]\, ... */

So it is defined and checked\, but not used elsewhere\, anytime. Hm.

Either the constant (0x0004) is used hard-coded somewhere\, or it is never set on the flags\, and thus can never be true...

UNICODE_ALLOW_ANY (which includes the UNICODE_ALLOW_FFFF flag) is used in several places where we want to allow any arbitrary character without warning.

Ah. Missed that. In any event\, there is no direct test for UNICODE_ALLOW_FFFF\, only for UNICODE_ALLOW_FFFE\, it seems.

A​: grep "0x[0]*4 " * -r

doesn't yield any results\, maybe it is used as flags += 4; or something - though I doubt it.

Disabling the warnings just works around the bug that FFFF is not allowed and there seems to be no way to actually allow it.

I don't understand what you are saying. The code in utf8.c only issues a warning; it has no other effect. If the warning is turned off\, it does nothing\, and the character can be considered allowed.

But you can only turn of all warnings\, not just for this character specially. And I think the original question was​: why warn does it warn? Should it warn?

(I cannot answer this\, my knowledge of unicode is too limited for that).

Best wishes\,

Tels

-- Signed on Sun Jan 22 12​:23​:52 2006 with key 0x93B84C15. Visit my photo gallery at http​://bloodgate.com/photos/ PGP key on http​://bloodgate.com/tels.asc or per email.

Firefox​: What are you trying to tell me\, that I can block pop-ups? Morpheus​: I'm trying to tell you that when you're ready\, you won't have to.   -- Skyshadow (508) on 2004-11-30 at /.

p5pRT commented 18 years ago

From schmorp@schmorp.de

On Sun\, Jan 22\, 2006 at 03​:38​:56AM -0800\, Tels via RT \perlbug\-followup@&#8203;perl\.org wrote​:

Disabling the warnings just works around the bug that FFFF is not allowed and there seems to be no way to actually allow it.

I don't understand what you are saying. The code in utf8.c only issues a warning; it has no other effect. If the warning is turned off\, it does nothing\, and the character can be considered allowed.

But you can only turn of all warnings\, not just for this character specially. And I think the original question was​: why warn does it warn? Should it warn?

No\, the original question was why 0xffff and 0xfffe are not handled the same way.

However\, the issue was deeper​: I didn't know that you even could turn this warning off.

Turns out there are multiple issues​:

- warnings that mention "Malformed UTF-*" are simply broken unless they are   in response to utf8​::decode or other such functions\, as perl supports   octets OR unicode text in scalars\, and _both_ can be encoded as latin1 or   utf-8 internally\, but this is an implementation detail. It should be   "illegal (unicode) character point" or something shorter\, as utf-8   (or utf-16) is not involved when I use chr 65535 anywhere.

- regexes ignore chr 65535 when interpolating scalars\, but not 65534. They   should ignore neither\, which is independent of the warning.

--   The choice of a   -----==- _GNU_   ----==-- _ generation Marc Lehmann   ---==---(_)__ __ ____ __ pcg@​goof.com   --==---/ / _ \/ // /\ \/ / http​://schmorp.de/   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Sun\, 22 Jan 2006 00​:42​:32 -0800\, Yitzchak Scott-Thoennes \sthoenna@&#8203;efn\.org wrote

I'm thinking all the utf8n_to_uv* calls in regcomp should use UTF8_ALLOW_ANYUV. The only point I see in warning about things like \x{ffff} would be if they were literal utf8 in the source code\, and it looks to me like those are caught outside of pregcomp.

I have assumed (ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY) is the flags in default for utf8n_to_uvuni and utf8n_to_uvchr since it is used by utf8_to_uvchr() and utf8_to_uvuni().

But actually several kinds of flags are used in perl-current as below​:

  ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV   0   UTF8_ALLOW_ANYUV   UTF8_ALLOW_ANY

then not sure whether 0 or UTF8_ALLOW_ANYUV shoule be preferred...

Regards\, SADAHIRO Tomoyuki

P.S. a list of usage of utf8n_to_uvuni and utf8n_to_uvchr (files in subdirectories are not looked into\, though)

  cf. legend   ### \​:\   \(\<line number(s)>)​:\

### doop.c​:S_do_trans_simple doop.c(72)​:   const UV c = utf8n_to_uvchr(s\, send - s\, &ulen\, 0);

### doop.c​:S_do_trans_count doop.c(122)​:   const UV c = utf8n_to_uvchr(s\, send - s\, &ulen\, 0);

### doop.c​:Perl_do_vop doop.c(1230)​: luc = utf8n_to_uvchr((U8*)lc\, lulen\, &ulen\, UTF8_ALLOW_ANYUV); doop.c(1233)​: ruc = utf8n_to_uvchr((U8*)rc\, rulen\, &ulen\, UTF8_ALLOW_ANYUV); doop.c(1245)​: luc = utf8n_to_uvchr((U8*)lc\, lulen\, &ulen\, UTF8_ALLOW_ANYUV); doop.c(1248)​: ruc = utf8n_to_uvchr((U8*)rc\, rulen\, &ulen\, UTF8_ALLOW_ANYUV); doop.c(1257)​: luc = utf8n_to_uvchr((U8*)lc\, lulen\, &ulen\, UTF8_ALLOW_ANYUV); doop.c(1260)​: ruc = utf8n_to_uvchr((U8*)rc\, rulen\, &ulen\, UTF8_ALLOW_ANYUV);

### op.c​:Perl_pmtrans op.c(2495)​: cp[2*i] = utf8n_to_uvuni(t\, tend-t\, &ulen\, 0); op.c(2499)​: cp[2*i+1] = utf8n_to_uvuni(t\, tend-t\, &ulen\, 0); op.c(2553)​: tfirst = (I32)utf8n_to_uvuni(t\, tend - t\, &ulen\, 0); op.c(2557)​: tlast = (I32)utf8n_to_uvuni(t\, tend - t\, &ulen\, 0); op.c(2567)​: rfirst = (I32)utf8n_to_uvuni(r\, rend - r\, &ulen\, 0); op.c(2571)​: rlast = (I32)utf8n_to_uvuni(r\, rend - r\, &ulen\, 0);

### pp.c​:Perl_pp_complement pp.c(2371)​: const UV c = utf8n_to_uvchr(tmps\, send-tmps\, &l\, UTF8_ALLOW_ANYUV); pp.c(2385)​: const UV c = utf8n_to_uvchr(tmps\, send-tmps\, &l\, UTF8_ALLOW_ANYUV); pp.c(2397)​: const U8 c = (U8)utf8n_to_uvchr(tmps\, 0\, &l\, UTF8_ALLOW_ANY);

### pp.c​:Perl_pp_ord pp.c(3265)​: utf8n_to_uvchr(s\, UTF8_MAXBYTES\, 0\, UTF8_ALLOW_ANYUV) :

### pp_pack.c​:uni_to_byte pp_pack.c(621-622)​:   UV val = utf8n_to_uvchr((U8 *) *s\, end-*s\, &retlen\,   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);

### pp_pack.c​:uni_to_bytes pp_pack.c(651-652)​:   const U32 flags = ckWARN(WARN_UTF8) ?   UTF8_CHECK_ONLY : (UTF8_CHECK_ONLY | UTF8_ALLOW_ANY); pp_pack.c(655)​: val = utf8n_to_uvchr((U8 *) from\, end-from\, &retlen\, flags); pp_pack.c(674)​: utf8n_to_uvuni((U8 *) ptr\, end-ptr\, &retlen\, flags);

### pp_pack.c​:next_uni_uu pp_pack.c(694)​:   const UV val = utf8n_to_uvchr((U8 *) *s\, end-*s\, &retlen\, UTF8_CHECK_ONLY);

### pp_pack.c​:NEXT_UNI_VAL pp_pack.c(776)​: val = utf8n_to_uvchr((U8 *) str\, end-str\, &retlen\, utf8_flags);

### pp_pack.c​:S_unpack_rec pp_pack.c(1635-1636)​:   const UV val = utf8n_to_uvchr((U8 *) s\, strend-s\, &retlen\,   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY) pp_pack.c(1689)​:   auv = utf8n_to_uvuni(result\, len\, &retlen\,   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV); pp_pack.c(1692)​:   auv = utf8n_to_uvuni((U8*)s\, strend - s\, &retlen\,   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV);

### regcomp.c​:TRIE_READ_CHAR regcomp.c(759)​:   uvc = utf8n_to_uvuni( scan\, UTF8_MAXLEN\, &len\, uniflags ); regcomp.c(764)​:   uvc = utf8n_to_uvuni( (const U8*)uc\, UTF8_MAXLEN\, &len\, uniflags); regcomp.c(770)​:   uvc = utf8n_to_uvuni( (const U8*)uc\, UTF8_MAXLEN\, &len\, uniflags);

### regcomp.c​:S_make_trie regcomp.c(809)​:   const U32 uniflags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY;

### regcomp.c​:S_regatom regcomp.c(4269-4270)​:   ender = utf8n_to_uvchr((U8*)p\, RExC_end - p\, &numlen\, 0);

### regcomp.c​:S_regclass regcomp.c(4693-4695)​:   value = utf8n_to_uvchr((U8*)RExC_parse\, RExC_end - RExC_parse\, &numlen\, 0); regcomp.c(4705-4707)​:   value = utf8n_to_uvchr((U8*)RExC_parse\, RExC_end - RExC_parse\, &numlen\, 0);

### regexec.c​:S_find_byclass regexec.c(1036)​:   const U32 uniflags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY; regexec.c(1041-42)​:   c1 = utf8n_to_uvchr(tmpbuf1\, UTF8_MAXBYTES_CASE\, 0\, uniflags); regexec.c(1043-44)​:   c2 = utf8n_to_uvchr(tmpbuf2\, UTF8_MAXBYTES_CASE\, 0\, uniflags); regexec.c(1083)​:   const U32 uniflags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY; regexec.c(1088-89)​:   c = utf8n_to_uvchr((U8*)s\, UTF8_MAXBYTES\, &len\, uniflags); regexec.c(1115-16)​:   c = utf8n_to_uvchr((U8*)s\, UTF8_MAXBYTES\, &len\, uniflags); regexec.c(1185)​:   tmp = utf8n_to_uvchr(r\, UTF8SKIP(r)\, 0\, 0); regexec.c(1227)​:   tmp = utf8n_to_uvchr(r\, UTF8SKIP(r)\, 0\, 0);

### regexec.c​:S_regmatch regexec.c(2411)​:   U32 uniflags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY; regexec.c(2609)​:   uvc = utf8n_to_uvuni( uscan\, UTF8_MAXLEN\, &len\, uniflags ); regexec.c(2615)​:   uvc = utf8n_to_uvuni( (U8*)uc\, UTF8_MAXLEN\, &len\, uniflags ); regexec.c(2673)​:   uvc = utf8n_to_uvuni( (U8*)uc\, UTF8_MAXLEN\, &len\, uniflags ); regexec.c(2800-01)​:   utf8n_to_uvuni((U8*)l\, UTF8_MAXBYTES\, &ulen\, uniflags)) regexec.c(2814-15)​:   utf8n_to_uvuni((U8*)s\, UTF8_MAXBYTES\, &ulen\, uniflags)) regexec.c(2976)​:   ln = utf8n_to_uvchr(r\, UTF8SKIP(r)\, 0\, 0); regexec.c(3929-30)​:   c1 = utf8n_to_uvuni(tmpbuf1\, UTF8_MAXBYTES\, 0\, uniflags); regexec.c(3931-32)​:   c2 = utf8n_to_uvuni(tmpbuf2\, UTF8_MAXBYTES\, 0\, uniflags); regexec.c(3935-36)​:   c2 = c1 = utf8n_to_uvchr(s\, UTF8_MAXBYTES\, 0\, uniflags); regexec.c(3995-97)​:   utf8n_to_uvchr((U8*)locinput\, UTF8_MAXBYTES\, &len\, uniflags) != (UV)c1) { regexec.c(4006-08)​:   UV c = utf8n_to_uvchr((U8*)locinput\, UTF8_MAXBYTES\, &len\, uniflags); regexec.c(4042-44)​:   c = utf8n_to_uvchr((U8*)PL_reginput\, UTF8_MAXBYTES\, 0\, uniflags); regexec.c(4091-93)​:   c = utf8n_to_uvchr((U8*)PL_reginput\, UTF8_MAXBYTES\, 0\, uniflags); regexec.c(4113-15)​:   c = utf8n_to_uvchr((U8*)PL_reginput\, UTF8_MAXBYTES\, 0\, uniflags);

### regexec.c​:S_reginclass regexec.c(4714-16)​:   c = utf8n_to_uvchr(p\, UTF8_MAXBYTES\, &len\,   ckWARN(WARN_UTF8) ? UTF8_CHECK_ONLY :   UTF8_ALLOW_ANYUV|UTF8_CHECK_ONLY); ### sv.c​:Perl_sv_pos_b2u sv.c(5394)​: utf8n_to_uvchr(s\, UTF8SKIP(s)\, &n\, 0);

### sv.c​:Perl_sv_vcatpvfn sv.c(8393-94)​: uv = utf8n_to_uvchr(vecstr\, veclen\, &ulen\, UTF8_ALLOW_ANYUV); sv.c(8479-80)​: uv = utf8n_to_uvchr(vecstr\, veclen\, &ulen\, UTF8_ALLOW_ANYUV);

### toke.c​:Perl_str_to_version toke.c(1056)​: n = utf8n_to_uvchr((U8*)start\, len\, &skip\, 0);

### toke.c​:S_scan_const toke.c(1890)​:   const UV uv = (this_utf8) ? utf8n_to_uvchr((U8*)s\, send - s\, &len\, 0)   : (UV) ((U8) *s); ### utf8.c​:Perl_utf8_to_uvchr utf8.c(618-619)​:   return utf8n_to_uvchr(s\, UTF8_MAXBYTES\, retlen\,   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);

### utf8.c​:Perl_utf8_to_uvuni utf8.c(642-643)​:   return Perl_utf8n_to_uvuni(aTHX_ s\, UTF8_MAXBYTES\, retlen\,   ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY);

### utf8.c​:Perl_swash_fetch utf8.c(1728-30)​:   const UV code_point = utf8n_to_uvuni(ptr\, UTF8_MAXBYTES\, 0\,   ckWARN(WARN_UTF8) ?   0 : UTF8_ALLOW_ANY);

### end of list ###

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Sat\, 21 Jan 2006 18​:39​:26 +0100\, Marc Lehmann \schmorp@&#8203;schmorp\.de wrote

On Sat\, Jan 21\, 2006 at 04​:18​:21AM -0800\, Tels via RT \perlbug\-followup@&#8203;perl\.org wrote​:

doesn't yield any results\, maybe it is used as flags += 4; or something - though I doubt it.

Disabling the warnings just works around the bug that FFFF is not allowed and there seems to be no way to actually allow it.

Well\, not allowing FFFF has some merit\, too\, but either all illegal codepoints should be disallowed or none at all.

The "Malformed UTF-8"... is also not quite a warning\, as the resulting regex won't work (works neither in s/// nor in y///\, and its not related to character constants)​:

# perl -e '$c = chr 65535; $c=~s/$c//g; print $c'|xxd Malformed UTF-8 character (character 0xffff) in regexp compilation at -e line 1. Malformed UTF-8 character (character 0xffff) in regexp compilation at -e line 1. 0000000​: efbf bf ...

At least when the scope is out of use warnnings 'utf8'\, such code points as U+FFFF should be allowed. Perl-current allows s/\x{ffff}//g (escaped) to remove U+FFFF\, but neither tr/\x{ffff}//d nor s/${\chr(0xffff)}//g (interpolated and parsed as a literal); that is inconsistent.

Patch is attached to this mail;   the filename : allowFFFF.patch.gz

There I define UTF8_ALLOW_DEFAULT macro in utf8.h\, to help the consistent choice of flags for utf8n_to_uv(chr|uni).

#define UTF8_ALLOW_DEFAULT (ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV)

cf. a report on what flags are used perl-current​:   http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2006-01/msg00842.html

The reason why utf8n_to_uvchr in S_reginclass has (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV) instead of UTF8_ALLOW_DEFAULT is that the problem of [perl #37836] should not come back when UTF8_ALLOW_DEFAULT would include UTF8_ALLOW_ANY instead of UTF8_ALLOW_ANYUV.

Then this patch includes a test for #37836 as well as tests for this problem #38293.

Regards\, SADAHIRO Tomoyuki

p5pRT commented 18 years ago

From BQW10602@nifty.com

allowFFFF.patch.gz

p5pRT commented 18 years ago

From @rgarcia

On 4/2/06\, SADAHIRO Tomoyuki \bqw10602@&#8203;nifty\.com wrote​:

At least when the scope is out of use warnnings 'utf8'\, such code points as U+FFFF should be allowed.

I don't think this is documented (the effect of warnings on that)\, it should be IMO. Care to submit a doc patch ? perlunicode.pod is probably the best place.

Thanks\, your patch applied as #27688.

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Sun\, 2 Apr 2006 23​:00​:48 +0200\, "Rafael Garcia-Suarez" \rgarciasuarez@&#8203;gmail\.com wrote

On 4/2/06\, SADAHIRO Tomoyuki \bqw10602@&#8203;nifty\.com wrote​:

At least when the scope is out of use warnings 'utf8'\, such code points as U+FFFF should be allowed.

I don't think this is documented (the effect of warnings on that)\, it should be IMO. Care to submit a doc patch ? perlunicode.pod is probably the best place.

Currently perl doesn't search a string for Unicode's noncharacters like U+FFFF as thier byte sequences like "\xef\xbf\xbf" for U+FFFF in UTF-8 (of course byte sequences in UTF-EBCDIC are different).

In other words\, perl's warning against a noncharacter is generated just when perl detects the code value as an integer. There are two cases​: one is the conversion from integer to string which is implemented as Perl_uvuni_to_utf8_flags; another is the conversion from string to integer which is implemented as Perl_utf8n_to_uvuni.

There are some string operations where a string containing noncharacters causes no warning even under use warnings 'utf8'.

First\, operations which don't require the code value include assign (=)\, cmp\, eq\, chomp\, concat (.)\, length\, repeat (x)\, substr.

Second\, operations which suppress the warning through the flags include complement(~)\, ord.

Under the circumstances I doubt the documentation would have any explanation about when use warnings 'utf8' should warn against noncharacters.

And the Unicode standard states - noncharacters are represented as byte sequences that are   well-formed (cf. TUS4.0\, definition D36 in Section 3.9\, p.78);   they are neither ill-formed nor malformed. - applications are free to use any of noncharacters for internal   uses (cf. TUS 4.0\, Section 15.8\, p.400); it is open interexchange   and interpretation as abstract characters (cf. TUS4.0\, conformance   C5 in Section 3.2\, p.59) that are forbidden about noncharacters.

Why not allow noncharacters?

Regards\, SADAHIRO Tomoyuki

p5pRT commented 18 years ago

From guest@guest.guest.xxxxxxxx

See also perl #38722

Allowing U+FFFE is a latent security hole\, much the same as decoding overlong utf-8 sequences is. If an attacker can get a U+FFFE character past a security-sensitive syntax checker and into a string\, then if the string is subsequently encoded in UTF-16 and then decoded\, the UTF-16 decoder will see a reversed BYTE ORDER MARK and start byte-swapping the remaining data. That permits the attacker to get characters interpreted that the syntax checker would have refused.

(I have tried multiple times to raise this issue with the Unicode Consortium\, but all of my mail on the topic has been ignored.)

John Gardiner Myers jgmyers@​proofpoint.com

p5pRT commented 18 years ago

From schmorp@schmorp.de

On Wed\, Apr 19\, 2006 at 05​:40​:31PM -0700\, Guest via RT \perlbug\-followup@&#8203;perl\.org wrote​:

Allowing U+FFFE is a latent security hole\, much the same as decoding

Its as much a security whole as system()\, so while system is security-relevant\, it still needs to be there.

The same is true for being able to process U+FFFE.

I completely agree that this needs to be able to be checked\, but just not processing it is wrong​: Combining characters are also the same form of "latent sefurity hole"\, but you just have to live with them.

--   The choice of a   -----==- _GNU_   ----==-- _ generation Marc Lehmann   ---==---(_)__ __ ____ __ pcg@​goof.com   --==---/ / _ \/ // /\ \/ / http​://schmorp.de/   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 18 years ago

From guest@guest.guest.xxxxxxxx

[schmorp@​schmorp.de - Wed Apr 19 19​:54​:13 2006]​: Its as much a security whole as system()\, so while system is security-relevant\, it still needs to be there.

system() is protected by such things as the tainting subsystem and syntax checking on untrusted input.

: Combining characters are also the same form of "latent sefurity hole"\, but you just have to live with them.

U+FFFE provides a hole by which an attacker can get arbitrary characters\, such as '/' and '*'\, past security-sensitive syntax checkers attempting to block such characters.

Combining characters provide no such method for getting characters such as '/' and '*' past these security-sensitive syntax checkers\, so combining characters are not the same form of latent security hole.

p5pRT commented 18 years ago

From schmorp@schmorp.de

On Thu\, Apr 27\, 2006 at 02​:38​:53PM -0700\, Guest via RT \perlbug\-followup@&#8203;perl\.org wrote​:

: Combining characters are also the same form of "latent sefurity hole"\, but you just have to live with them.

U+FFFE provides a hole by which an attacker can get arbitrary characters\, such as '/' and '*'\, past security-sensitive syntax checkers attempting to block such characters.

That is an empty claim\, and hardly believable (you seem to confuse aspects of UTF-8 encoding with U+FFFE\, but the two don't go together. A security-sensitive syntax checker either corretcly flags invalid encoded characters or its simply broken).

Nevertheless\, perl should handle U+FFFE regardless of wether other software breaks or not\, so the whole discussion about security is off-topic.

--   The choice of a   -----==- _GNU_   ----==-- _ generation Marc Lehmann   ---==---(_)__ __ ____ __ pcg@​goof.com   --==---/ / _ \/ // /\ \/ / http​://schmorp.de/   -=====/_/_//_/\_\,_/ /_/\_\ XX11-RIPE

p5pRT commented 14 years ago

From @khwilliamson

Attached is a minimal patch to fix this. More work needs to be done on noncharacter code points\, but not in this patch.

p5pRT commented 14 years ago

From @khwilliamson

0001-Allow-U-0FFFF-in-regex.patch ```diff From 7d043505653febb64618275197c542739e03c1b6 Mon Sep 17 00:00:00 2001 From: Karl Williamson Date: Thu, 17 Dec 2009 20:07:32 -0700 Subject: [PATCH] Allow U+0FFFF in regex --- regexec.c | 6 ++++-- t/re/pat_advanced.t | 13 ++++++++++++- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/regexec.c b/regexec.c index 11c408f..17a0dc6 100644 --- a/regexec.c +++ b/regexec.c @@ -5948,8 +5948,10 @@ S_reginclass(pTHX_ const regexp *prog, register const regnode *n, register const if (do_utf8 && !UTF8_IS_INVARIANT(c)) { c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &len, - (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV) | UTF8_CHECK_ONLY); - /* see [perl #37836] for UTF8_ALLOW_ANYUV */ + (UTF8_ALLOW_DEFAULT & UTF8_ALLOW_ANYUV) + | UTF8_ALLOW_FFFF | UTF8_CHECK_ONLY); + /* see [perl #37836] for UTF8_ALLOW_ANYUV; [perl #38293] for + * UTF8_ALLOW_FFFF */ if (len == (STRLEN)-1) Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); } diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t index a0eec58..3a66a0c 100644 --- a/t/re/pat_advanced.t +++ b/t/re/pat_advanced.t @@ -21,7 +21,7 @@ BEGIN { } -plan tests => 1142; # Update this when adding/deleting tests. +plan tests => 1143; # Update this when adding/deleting tests. run_tests() unless caller; @@ -1770,6 +1770,17 @@ sub run_tests { iseq $_, "!Bang!1!Bang!2!Bang!3!Bang!"; } + { + # Earlier versions of Perl said this was fatal. + local $Message = "U+0FFFF shouldn't crash the regex engine"; + no warnings 'utf8'; + my $a = eval "chr(65535)"; + use warnings; + my $warning_message; + local $SIG{__WARN__} = sub { $warning_message = $_[0] }; + eval $a =~ /[a-z]/; + ok(1); # If it didn't crash, it worked. + } } # End of sub run_tests 1; -- 1.5.6.3 ```
p5pRT commented 14 years ago

From @rgarcia

2009/12/18 karl williamson \public@&#8203;khwilliamson\.com​:

Attached is a minimal patch to fix this.  More work needs to be done on noncharacter code points\, but not in this patch.

Thanks\, applied to blead as 6182169b72782336c6202161aa4cde16ac88296e

p5pRT commented 14 years ago

@rgs - Status changed from 'open' to 'resolved'