Punctuation characters not matched using [:punct:]

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Use [:punct:] in a pattern
2. It does not match some punctuation characters. (I've only tried the POSIX/C 
locale.)

What is the expected output? What do you see instead?
These punctuation characters are not matched using [:punct:]
'$', '+', '<', '=', '>', '^', '`', '|', '~'

Which version of Python? 32-bit or 64-bit?
3.4.2
32-bit and 64-bit

Which operating system? Big-endian or little-endian?
Little-endian

Please provide any additional information below.
An example script is attached.

All the punctuation characters are specified here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_0
3

7.3.1 LC_CTYPE

punct
    Define characters to be classified as punctuation characters.

    In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included.

LC_CTYPE Category in the POSIX Locale

punct    <exclamation-mark>;<quotation-mark>;<number-sign>;\
         <dollar-sign>;<percent-sign>;<ampersand>;<apostrophe>;\
         <left-parenthesis>;<right-parenthesis>;<asterisk>;\
         <plus-sign>;<comma>;<hyphen>;<period>;<slash>;\
         <colon>;<semicolon>;<less-than-sign>;<equals-sign>;\
         <greater-than-sign>;<question-mark>;<commercial-at>;\
         <left-square-bracket>;<backslash>;<right-square-bracket>;\
         <circumflex>;<underscore>;<grave-accent>;<left-curly-bracket>;\
         <vertical-line>;<right-curly-bracket>;<tilde>

Original issue reported on code.google.com by plane...@gmail.com on 9 Dec 2014 at 2:54

Attachments:

bug-regex-punct.py

GoogleCodeExporter commented 9 years ago

The POSIX Locale is, well, a _locale_. Therefore, you need to use the LOCALE 
flag and bytestrings:

regex.findall(b'(?L)[[:punct:]]' , ascii_sorted.encode('ascii'))

On Unicode strings, [[:punct:]] is mapped to \p{Punct}, which uses the Unicode 
definition of 'punctuation'.

Original comment by re...@mrabarnett.plus.com on 9 Dec 2014 at 2:17

Changed state: Invalid
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I thought that doing locale.setlocale(locale.LC_CTYPE, 'C') would set the 
locale used by regex.

Why doesn't this work?
regex.findall(b'[[:punct:]]' , ascii_sorted.encode('ascii'), flags=regex.ASCII)

Original comment by plane...@gmail.com on 10 Dec 2014 at 3:28

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The re module requires the LOCALE flag in order to make \w, \s and \b (and 
their complements) locale-sensitive.

The regex module is intended to be compatible with the re module, and it merely 
adds some more character classes.

Original comment by re...@mrabarnett.plus.com on 10 Dec 2014 at 11:18

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

According the the Python docs, the LOCALE flag will be going away.

https://docs.python.org/3.5/library/re.html#re.L
re.L
re.LOCALE

    Make \w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one "culture" at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns. This flag makes sense only with bytes patterns.

    Deprecated since version 3.5, will be removed in version 3.6: Deprecated the use of re.LOCALE with string patterns or re.ASCII.

Original comment by plane...@gmail.com on 12 Dec 2014 at 3:57

Added labels: ****
Removed labels: ****

Forever-Young / mrab-regex-hg

Punctuation characters not matched using [:punct:] #130