Closed GoogleCodeExporter closed 9 years ago
In Python 2 it defaults to ASCII (like the re module), so when working with
Unicode you need to provide the UNICODE flag (or "(?u)" in the regex):
>>> import regex
>>> regex.match(r'\p{Symbol}', u'\ufffd', flags=regex.U)
<_regex.Match object at 0x00EFC3A0>
Original comment by re...@mrabarnett.plus.com
on 23 May 2012 at 4:52
I see :) though, admittedly it's counterintuitive --- and counter to the
documentation, http://pypi.python.org/pypi/regex, which says explicitely:
If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to
UNICODE if the regex pattern is a Unicode string and ASCII if it's a bytestring.
That was misleading, even more so as I explicitely use the \p form of the
character classes, which does not exist for module re. for this reason, so I
didn't even think of this not using unicode :)
diff -r 6f0e839b0db0 regex_2/Features.rst
--- a/regex_2/Features.rst Sun May 06 17:01:09 2012 +0100
+++ b/regex_2/Features.rst Thu May 24 11:00:26 2012 +0200
@@ -93,7 +93,7 @@
The global flags are: ``ASCII``, ``BESTMATCH``, ENHANCEMATCH``, ``LOCALE``, ````REVERSE``, ``UNICODE``, ``VERSION0``, ``VERSION1``.
-If neither the ``ASCII``, ``LOCALE`` nor ``UNICODE`` flag is specified, it
will default to ``UNICODE`` if the regex pattern is a Unicode string and
``ASCII`` if it's a bytestring.
+If neither the ``ASCII``, ``LOCALE`` nor ``UNICODE`` flag is specified, the
defaults are set like module re: On python 3, the character class semantics
will default to ``UNICODE`` if the regex pattern is a Unicode string and
``ASCII`` if it's a bytestring. On python 2, the character classes will
default to ``ASCII``.
The ``ENHANCEMATCH`` flag makes fuzzy matching attempt to improve the fit of the next match that it finds.
Original comment by susefroh
on 24 May 2012 at 9:00
As you pointed out, the documentation says (emphasis added):
...it will default to UNICODE if the _regex pattern_ is a Unicode string...
Your example has a _bytestring_ pattern and a _Unicode_ text.
I'll see if I can improve the situation.
Original comment by re...@mrabarnett.plus.com
on 24 May 2012 at 12:54
duh. I *love* python 2 unicode. I think there is very little you can do.
If you added a regex compile time test for non-unicode patterns containing
unicode specific character classes, what would you do with it? turn on unicode
automatically? that would not potentially harm some existing code. Emit a
run-time warning?
Unless hords of people run into the same oversight-issue, I think it's not
worth putting more energy into it.
Original comment by susefroh
on 24 May 2012 at 2:06
Original issue reported on code.google.com by
susefroh
on 23 May 2012 at 4:20