Open GoogleCodeExporter opened 9 years ago
Ah, yes, if the pattern is a Unicode string then the matching defaults to
Unicode, and if the pattern is a bytestring then the matching defaults to ASCII.
You can be explicit with regex.UNICODE or "(?u)" and regex.ASCII or "(?a)".
The justification is that if you're using Unicode strings then you probably
want Unicode matching too. I'll make a note to update the docs at some point (I
don't have any other changes planned).
I would be willing to make it the same as the 're' module if the general
consensus is that it should be.
Original comment by goo...@mrabarnett.plus.com
on 7 Feb 2011 at 2:55
Thanks for confirmation; I was just a bit surprised seeing different results in
a script (using re) and my general app (using regex normally), where I didn't
expect a difference between these re engines.
I am happy with either behaviour; the (?u) can be simply added if needed and is
more explicit; on the other hand the unicode flag is global and cannot be
switched off - if one needed an unicode string pattern with special sequences
to be interpreted in ascii, [a-zA-Z0-9_] would be necessary instead of \w (if I
understand correctly).
But that being said, I have no strong personal preference, now that it is
documented. It would depend on the inclusion policy into the standard library
(e.g. whether to include this behaviour to the NEW flag).
vbr
Original comment by Vlastimil.Brom@gmail.com
on 7 Feb 2011 at 1:45
Original issue reported on code.google.com by
Vlastimil.Brom@gmail.com
on 7 Feb 2011 at 1:13