Forever-Young / mrab-regex-hg

Automatically exported from code.google.com/p/mrab-regex-hg
0 stars 0 forks source link

regex.match('\p{Symbol}',u'\ufffd') is None, should match #67

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The unicode symbol \ufffd is the 'REPLACEMENT CHARACTER', which is the default 
used by python for decoding borked characters into unicode.

I guess some other characters, like the 'REGISTERED SIGN' \u00ae also fail to 
match \p{Symbol}.

What steps will reproduce the problem?
>>> import regex
>>> regex.match('\p{Symbol}',u'\ufffd') is None
True

What is the expected output? What do you see instead?

The expected output is False

What version of the product are you using? On what operating system?
regex-0.1.20120506/Python2/regex.py on Linux

Please provide any additional information below.

Original issue reported on code.google.com by susefroh on 23 May 2012 at 4:20

GoogleCodeExporter commented 9 years ago
In Python 2 it defaults to ASCII (like the re module), so when working with 
Unicode you need to provide the UNICODE flag (or "(?u)" in the regex):

>>> import regex
>>> regex.match(r'\p{Symbol}', u'\ufffd', flags=regex.U)
<_regex.Match object at 0x00EFC3A0>

Original comment by re...@mrabarnett.plus.com on 23 May 2012 at 4:52

GoogleCodeExporter commented 9 years ago
I see :) though, admittedly it's counterintuitive --- and counter to the 
documentation, http://pypi.python.org/pypi/regex, which says explicitely:

If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to 
UNICODE if the regex pattern is a Unicode string and ASCII if it's a bytestring.

That was misleading, even more so as I explicitely use the \p form of the 
character classes, which does not exist for module re.  for this reason, so I 
didn't even think of this not using unicode :)

diff -r 6f0e839b0db0 regex_2/Features.rst
--- a/regex_2/Features.rst  Sun May 06 17:01:09 2012 +0100
+++ b/regex_2/Features.rst  Thu May 24 11:00:26 2012 +0200
@@ -93,7 +93,7 @@

 The global flags are: ``ASCII``, ``BESTMATCH``, ENHANCEMATCH``, ``LOCALE``, ````REVERSE``, ``UNICODE``, ``VERSION0``, ``VERSION1``.

-If neither the ``ASCII``, ``LOCALE`` nor ``UNICODE`` flag is specified, it 
will default to ``UNICODE`` if the regex pattern is a Unicode string and 
``ASCII`` if it's a bytestring.
+If neither the ``ASCII``, ``LOCALE`` nor ``UNICODE`` flag is specified, the 
defaults are set like module re:  On python 3, the character class semantics 
will default to ``UNICODE`` if the regex pattern is a Unicode string and 
``ASCII`` if it's a bytestring.  On python 2, the character classes will 
default to ``ASCII``.

 The ``ENHANCEMATCH`` flag makes fuzzy matching attempt to improve the fit of the next match that it finds.

Original comment by susefroh on 24 May 2012 at 9:00

GoogleCodeExporter commented 9 years ago
As you pointed out, the documentation says (emphasis added):

   ...it will default to UNICODE if the _regex pattern_ is a Unicode string...

Your example has a _bytestring_ pattern and a _Unicode_ text.

I'll see if I can improve the situation.

Original comment by re...@mrabarnett.plus.com on 24 May 2012 at 12:54

GoogleCodeExporter commented 9 years ago
duh.  I *love* python 2 unicode.  I think there is very little you can do.

If you added a regex compile time test for non-unicode patterns containing 
unicode specific character classes, what would you do with it?  turn on unicode 
automatically?   that would not potentially harm some existing code.  Emit a 
run-time warning?

Unless hords of people run into the same oversight-issue, I think it's not 
worth putting more energy into it.

Original comment by susefroh on 24 May 2012 at 2:06