attardi / wikiextractor

A tool for extracting plain text from Wikipedia dumps
GNU Affero General Public License v3.0
3.74k stars 965 forks source link

Make the regex python 3.11 compatible #313

Open santhoshtr opened 1 year ago

santhoshtr commented 1 year ago

Python 3.11 throws the error: re.error: global flags not at the start of the expression at position

Ref: https://bugs.python.org/issue47066

becaHEse commented 1 year ago

Thanks for your commits, it worked!

xtexChooser commented 1 year ago

Thanks, it worked!

santhoshtr commented 1 year ago

@attardi As python 3.12 is the latest version, people might want to use 3.11 or 3.12. Gentle ping to merge and release new version. Thanks!

ghost commented 11 months ago

Does the regular expression do what it was supposed to do? I think the proposed change does not change the behavior, so it would be correct, but I am not sure if the regular expression was originally supposed to do something different.

If I understand the documentation for (?aiLmsux) correctly, then (?i) changes the whole regular expression to case-insensitivity no matter where it is placed in the regular expression. Probably to remove this confusing behavior, it's only allowed at the beginning of the regex as of Python 3.11.

Since in this code's regular expression it is listed in a sub group, it looks to me like the original author might have (wrongly) tried to make only this sub group case insensitive. If that's the case I think (?aiLmsux-imsx:...) might be the correct version (available since version 3.6).

The letters set or remove the corresponding flags [...] for the part of the expression.

If indeed case-insensitivity for the whole regular expression is desired, I'd propose to rather use the compile flag re.I. Both regular expressions already use other compile flags, so adding the global case-insensitivity flag alongside them seems more readable. The documentation even states that (?i) is basically an alternative for re.I:

set the corresponding flags: [...] for the entire regular expression. This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

ExtLinkBracketedRegex = re.compile(
    '\[((' + '|'.join(wgUrlProtocols) + ')' + EXT_LINK_URL_CLASS + r'+)\s*([^\]\x00-\x08\x0a-\x1F]*?)\]',
    re.S | re.U | re.I)
EXT_IMAGE_REGEX = re.compile(
    r"""^(http://|https://)([^][<>"\x00-\x20\x7F\s]+)
    /([A-Za-z0-9_.,~%\-+&;#*?!=()@\x80-\xFF]+)\.(gif|png|jpg|jpeg)$""",
    re.X | re.S | re.U | re.I)

Update: Looking at both regular expression, I'd say the global version seems right. Even if the original code looks like the author intended to only make parts of the expression case-insensitive.

Let's assume the original author indeed only wanted to make parts of the expressions case-insensitive. Then in ExtLinkBracketedRegex the URL protocols (bitcoin://, ftp://, and so on) would be set to case-insensitive. But in EXT_IMAGE_REGEX only the file extension (gif, png, and so on) would be set to case-insensitive, but the URL protocol (http://, https://) would be left case-sensitive. That would be quite inconsistent.

BeautiLu commented 6 months ago

Thanks for your commits, it worked!