axiak / pyre2

Python wrapper for RE2
BSD 3-Clause "New" or "Revised" License
295 stars 39 forks source link

Unicode differences between re2 and re? #5

Open turian opened 13 years ago

turian commented 13 years ago

I am seeing difference betweens re2 and re when there is re.UNICODE being using.

I am not able to get re2 to detect Unicode alphabetic characters, even when I encode to UTF-8.

Here is an example:

In [24]: print u'\xe8'.encode("utf-8")
è

In [25]: re.compile('[^\W]', re.UNICODE).search(u'\xe8')
Out[25]: <_sre.SRE_Match object at 0x1186850>

In [26]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8')

In [27]: re2.compile('[^\W]', re.UNICODE).search(u'\xe8'.encode("utf-8"))
itsadok commented 13 years ago

This is a glaring omission in prepare_pattern: we only handle \d, \w and \s, but not the corresponding \D, \W and \S. I'll try to find some time to fix it.

turian commented 12 years ago

Please.

axiak commented 12 years ago

We had an issue with \W, \D and \S that itsadok just fixed and I pushed out. However, I think there are still unicode issues as the groups in issue #4 don't match up quite right (I added it as a test). Please pull the latest version and see if it works for you as I try to see why the test is failing.