different handling of \w in unicode patterns in regex and re

GoogleCodeExporter commented 9 years ago

Hi,
I think, it may be an intended behaviour, but I did't find it mentioned 
anywhere in the docs. Sorry, if it is already discussed somewhere I haven't 
looked ...
It seems, that in the unicode patterns like ur"..." regex implicitely sets the 
unicode flag (?u), while re doesn't seem to do that. 

>>> re.findall(ur"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(ur"\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> re.findall(r"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(r"\w", u"aáb")
[u'a', u'b']
>>> re.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> regex.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> 

Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f

regards,
   Vlastimil Brom

Original issue reported on code.google.com by Vlastimil.Brom@gmail.com on 7 Feb 2011 at 1:13

GoogleCodeExporter commented 9 years ago

Ah, yes, if the pattern is a Unicode string then the matching defaults to 
Unicode, and if the pattern is a bytestring then the matching defaults to ASCII.

You can be explicit with regex.UNICODE or "(?u)" and regex.ASCII or "(?a)".

The justification is that if you're using Unicode strings then you probably 
want Unicode matching too. I'll make a note to update the docs at some point (I 
don't have any other changes planned).

I would be willing to make it the same as the 're' module if the general 
consensus is that it should be.

Original comment by goo...@mrabarnett.plus.com on 7 Feb 2011 at 2:55

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Thanks for confirmation; I was just a bit surprised seeing different results in 
a script (using re) and my general app (using regex normally), where I didn't 
expect a difference between these re engines.
I am happy with either behaviour; the (?u) can be simply added if needed and is 
more explicit; on the other hand the unicode flag is global and cannot be 
switched off - if one needed an unicode string pattern with special sequences 
to be interpreted in ascii, [a-zA-Z0-9_] would be necessary instead of \w (if I 
understand correctly).
But that being said, I have no strong personal preference, now that it is 
documented. It would depend on the inclusion policy into the standard library 
(e.g. whether to include this behaviour to the NEW flag).

vbr

Original comment by Vlastimil.Brom@gmail.com on 7 Feb 2011 at 1:45

Added labels: ****
Removed labels: ****

Forever-Young / mrab-regex-hg

different handling of \w in unicode patterns in regex and re #3