Improvements to the `re` module

lpereira commented 1 year ago

I've spent some time looking over the re and _sre modules and identified some things that could be improved:

[ ] Matching of literals is performed character-by-character. This has quite a bit of overhead inside _sre as there's an instruction dispatch per character, rather than using something like memcmp(). This can be further improved because each character is stored in its own SRE_CODE, which is large enough to store a Py_UCS4 -- so, for the common case where regexes are matching ASCII strings, there's quite a bit of wasted space there.
[ ] A|B where A and B are literals could leverage from a better strategy (e.g. using the Teddy algorithm, as used by Hyperscan).
[ ] Improve matches of literals preceded by fixed-size ranges. Searching for patterns such as [0-9]{2}/[0-9]{2}/[0-9]{4}, common when looking for dates, could be changed to something that searches for the / characters in the right places, using a fast byte search, and then checked for the ranges. As a special case, consider [0-9]/ the same as [0-9]{1}/ so that we can also apply this optimization.
[ ] Similar to the previous possible optimization: matching something like [0-9]*-[a-z]{1,4} should look for - first using a fast byte scan, backwards matching [0-9]*, and then finally [a-z]{1,4}.
[ ] In some cases, we don't even need to use the _sre module for a regex: we can generate/eval some Python code that uses methods from str directly. These methods are already quick, and even if they're not, optimizing them further would improve two things at a time. For instance, a regex like ^(foo|bar|baz) could be implemented as str.starts_with(('foo', 'bar', 'baz')); this would be a pessimization if Teddy were to be used in the _sre module, but it could be implemented in str instead.

There are many other things that could be improved, but I think this is a good representation of what could be done in the re module to improve things on a first pass.

gvanrossum commented 1 year ago

Nice audit. Which of these looks like it would be the quickest to implement? (Maybe the first?)

lpereira commented 1 year ago

I'm already looking at the first item.

At this point I don't want to change the representation of SRE opcodes, so, to store a string literal, I'm adding a literals array to the pattern object and referencing the string by its index; at a later time, making it inline or storing it in the info block (to improve prefetching, etc) is an option that I'll probably look at.

lpereira commented 1 year ago

I have a working prototype in this branch. Needs quite a bit of cleanup but it seems to be working as expected.

mdboom commented 1 year ago

This is cool. I think you probably already know this, but wanted to highlight in advance that the regex benchmarks in pyperformance aren't very representative of real world usage. I think they will be fine as guardrails against any unintended regressions, but I don't think they will be super useful for creating a good sense of "X faster". That said, there's lots of interesting regexs in the stdlib that do things that people probably do all the time (email, json, toml, csv, urllib, etc.) A benchmark over all the regexes that actually appear in the stdlib would probably be a much better approximation. Pulling all of the regexes across all the IETF standards would be an interesting exercise, too, since those tend to get used a lot of places in real code.

lpereira commented 1 year ago

Other things I've found that look like good candidates:

[ ] Some repeated patterns, such as \\n\\n\\n\\n+, which appears in the stdlib, could be folded into \\n{4,}, avoiding a lot of dispatches
[ ] Some charset checks could be folded into a category check. This should reduce the compiled pattern size slightly, without impacting performance. For instance, [0-9] could be folded to \d
[ ] Common category repetitions, say \s+, could have an specialized opcode. Currently, this regex generates 5 instructions

JelleZijlstra commented 1 year ago

For instance, [0-9] could be folded to \d

That's not entirely correct since other Unicode digits can also match \d. According to the docs [0-9] is equivalent to \d only if we're using a bytes pattern or the re.ASCII flag is used. My intuition would be that the vast majority of regex executions don't fall into those categories.

jonashaag commented 1 year ago

Similar to the previous possible optimization: matching something like [0-9]*-[a-z]{1,4} should look for - first using a fast byte scan, backwards matching [0-9]*, and then finally [a-z]{1,4}.

A generalization of this is to find a "plain" substring in the regex that you can look for with strstr as "prematcher"/"bloom filter". We've had some success with this approach.

markshannon commented 1 year ago

Any progress on this?

lpereira commented 1 year ago

I paused the work on this but I intend to resume it soon.

faster-cpython / ideas

Improvements to the `re` module #534