erikrose / parsimonious

The fastest pure-Python PEG parser I can muster
MIT License
1.82k stars 127 forks source link

Use regex instead of re? #162

Closed julienmalard closed 2 years ago

julienmalard commented 4 years ago

Hello,

While using parsimonious I came across an already-documented bug in the builtin re module, whereby abugidas with vowel marks (that is, most of South Asia and Oceania) fail to match the \w directive (e.g., as in re.match("^[\w\s][\w\s]*", "किशोरी", re.UNICODE), as taken from this question here). In my particular case, this bug crashes my system dynamics models when I run them in PySD.

It seems that the recommended "solution" is to simply used the regex instead of re. However, this would mean adding a dependency to parsimonious, so I thought I would ask whether you would look favourably upon such a suggestion before submitting a pull request. I was thinking of proposing something along the lines of:

try:
    import regex as re
except ImportError:
    import re

so that nothing breaks if the dependency is not installed.

Please do let me know. Thanks! -Julien Malard

buhl commented 4 years ago

Hi Julien, I was where not so long ago and made a PR then shortly after discovering it had already been tried and rejected.

https://github.com/erikrose/parsimonious/pull/161

julienmalard commented 4 years ago

Hello @buhl , Thanks for letting me know...@erikrose, any chance you might consider this? I see from the other pull request, though, that you prefer to discourage the use of regexes in grammars in the future. Would you happen to have a recommendation as to how the use case suggested above (\w and p{Mn}) could be implemented according to best practices?

I had considered manually generating a list of acceptable characters (see here), but was not sure whether that would be considered best practice or not.

Thanks!

jenstroeger commented 3 years ago

This might not be the right place, but @erikrose could you please document the meaning of ilmsuxa for regexes? Are those the single-letter flags documented by the regex module?

lonnen commented 2 years ago

@jenstroeger when in doubt, file a new issue. They're cheap. This time there's no need, however, since I addressed this with 6e52df0

lonnen commented 2 years ago

Some more background for the original issue:

According to https://bugs.python.org/issue12731, it looked like regex was going to replace re in the standard lib for 3.3, then maybe it was going to be added as an alternative, but then the issue fell of the radar for 9 or 10 years. It seems like this may still happen in the future, and regex is still actively developed and generally less uncompliant with unicode.

I understand @erikrose has (had?) other ambitions for Parsimonious, but unless a new contributor appears I don't expect to realize the dream of removing regular expressions anytime soon. Given this, pragmatism pushed me in favor of replacing re with regex until a suitable champion for the alternative appears.

lonnen commented 2 years ago

I sat on it another day and decided to ahead with it. @Oderjunkie helpfully had a PR already up so I rebased that, updated the version, and squashed it. This should be resolved now by c0ffc3eae32b9c31b65f959ae75d493de828e170

I still think it is valuable, long term, to achieve the removal of regexs as Erik describes here: https://github.com/erikrose/parsimonious/pull/149#issuecomment-449662784. This change is a pragmatic compromise to improve Parsimonious today while there are no alternatives to using regular expressions for some common use cases

thanks everyone!

julienmalard commented 2 years ago

Many thanks!