Closed p5pRT closed 13 years ago
The terse test would be:
$_ = "\x{df}"; utf8::upgrade($_); print /[\w]/ ^ /[\W]/ ? "ok\n" : "not ok\n";
A more illustrative test to wake you up:
% /usr/local/perl-5.8.0@18125/bin/perl -le ' $_ = "\x{df}"; utf8::upgrade($_); if (/([\w])/){ warn "a letter"; } if (/([\W])/){ warn "not a letter"; } ' a letter at -e line 4. not a letter at -e line 7.
I've tested with all perls since patch 9340 and they all misbehave.
I now took a look at this and it is surprisingly tricky (or so I think). The problem comes form the dual model of Perl: bytes and (Unicode) characters. Why the [\W] falsely matches is caused by the \xDF being *on* in the *byte* side of the \W. My quick straightforward attempts at fixing this break one of split.t tests where the split pattern includes \x80.
(Andreas J. Koenig) (via RT) \perlbug@​perl\.org wrote: :A more illustrative test to wake you up: : :% /usr/local/perl-5.8.0@18125/bin/perl -le ' :$_ = "\x{df}"; utf8::upgrade($_); :if (/([\w])/){ : warn "a letter"; :} :if (/([\W])/){ : warn "not a letter"; :} :' :a letter at -e line 4. :not a letter at -e line 7. : :I've tested with all perls since patch 9340 and they all misbehave.
I've talked a bit with Jarkko about this\, and we can't at this time come up with any fix other than to document the behaviour.
The core issue is that in the old 'bytes' world\, \xdf (and in fact all of \x80-\xff) are not treated as letters\, except when subject to the vagaries of locale. In the Unicode world\, \xdf maps to a character that is defined under Unicode as being a letter. And we are still trying to support both definitions.
There doesn't seem to be any consistent way of redefining the cases that doesn't break some existing tests.
Note that when used outside of a character class\, \xdf does not match /\W/; I don't currently understand why it is different in this case.
Jarkko's suggested workarounds: (1) \w and \W: with bytes \xDF is not a letter (except if using locale and the locale thinks \xDF is a letter)\, with Unicode \xDF is a letter (2) \pL and \PL: input can be either byte or Unicode (and \xDF is a letter) (3) \p{Word} and \P{Word}: ditto
Hugo
I documented this in perlunicode.pod as change #22031.
I think there are other tickets fixed by this\, but couldn't find any. I know I wrote one myself a couple years ago.
regen required
This is part of the Unicode Bug.
The bug is fixed only for /u regexes. Yves' analysis was that it wasn't fixable otherwise\, and this is part of the reason we are adding /u.
This patch requires [perl #78722] to be applied. Both are also available at git://github.com/khwilliamson/perl.git branch matching.
Essentially\, the patch just uses Unicode semantics if that is called for. The macros that do this have been applied earlier\, but there was a bug in one of them that didn't surface until this patch.
karl williamson (via RT) wrote:
# New Ticket Created by karl williamson # Please include the string: [perl #78726] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=78726 >
I think there are other tickets fixed by this\, but couldn't find any. I know I wrote one myself a couple years ago.
regen required
This is part of the Unicode Bug.
The bug is fixed only for /u regexes. Yves' analysis was that it wasn't fixable otherwise\, and this is part of the reason we are adding /u.
This patch requires [perl #78722] to be applied. Both are also available at git://github.com/khwilliamson/perl.git branch matching.
Essentially\, the patch just uses Unicode semantics if that is called for. The macros that do this have been applied earlier\, but there was a bug in one of them that didn't surface until this patch.
I shouldn't have created a new ticket for this. I've merged #78726 into #18281.
In one of the commit messages\, I say this is fixed for /u regexes only\, and not fixable for non-. However\, in the work I'm doing for solving the /iu problem\, I've discovered several wrong things in the old code\, that when corrected\, will also solve this for non-/u regexes.
On Sat Oct 30 16:16:33 2010\, public@khwilliamson.com wrote:
I think there are other tickets fixed by this\, but couldn't find any. I know I wrote one myself a couple years ago.
regen required
This is part of the Unicode Bug.
The bug is fixed only for /u regexes. Yves' analysis was that it wasn't fixable otherwise\, and this is part of the reason we are adding /u.
This patch requires [perl #78722] to be applied. Both are also available at git://github.com/khwilliamson/perl.git branch matching.
Essentially\, the patch just uses Unicode semantics if that is called for. The macros that do this have been applied earlier\, but there was a bug in one of them that didn't surface until this patch.
Patches 4-10 have been applied as: cbc24f92709e23449028ec3036bda16c0af294fb d35dd6c678badc24d545f8b7b7a3ebdf0fb0b355 e486b3ccda3754fd159530607148c92cbfcbddf8 aedd44b501ab1196eeb3ebe56ef7647debb77eab 9b7c43baf09d4c57d5cd6c9a052ce398d1626a6a 0399b2152e23eb6ce1f09562d53b87be7fe30924 7bbf947b84c2a0700fd31acb7a31342cd0b8f796
I added a comma before ‘respectively’ to patch number 6.
On Sat Oct 30 16:16:33 2010\, public@khwilliamson.com wrote:
I think there are other tickets fixed by this\, but couldn't find any. I know I wrote one myself a couple years ago.
regen required
This is part of the Unicode Bug.
The bug is fixed only for /u regexes. Yves' analysis was that it wasn't fixable otherwise\, and this is part of the reason we are adding /u.
This patch requires [perl #78722] to be applied. Both are also available at git://github.com/khwilliamson/perl.git branch matching.
Essentially\, the patch just uses Unicode semantics if that is called for. The macros that do this have been applied earlier\, but there was a bug in one of them that didn't surface until this patch.
Patches 4-10 have been applied as: cbc24f92709e23449028ec3036bda16c0af294fb d35dd6c678badc24d545f8b7b7a3ebdf0fb0b355 e486b3ccda3754fd159530607148c92cbfcbddf8 aedd44b501ab1196eeb3ebe56ef7647debb77eab 9b7c43baf09d4c57d5cd6c9a052ce398d1626a6a 0399b2152e23eb6ce1f09562d53b87be7fe30924 7bbf947b84c2a0700fd31acb7a31342cd0b8f796
I added a comma before ‘respectively’ to patch number 6.
This is now fixed --Karl Williamson
@khwilliamson - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#18281 (status was 'resolved')
Searchable as RT18281$