faster-cpython / ideas

1.67k stars 49 forks source link

Improvements to the `re` module #534

Open lpereira opened 1 year ago

lpereira commented 1 year ago

I've spent some time looking over the re and _sre modules and identified some things that could be improved:

There are many other things that could be improved, but I think this is a good representation of what could be done in the re module to improve things on a first pass.

gvanrossum commented 1 year ago

Nice audit. Which of these looks like it would be the quickest to implement? (Maybe the first?)

lpereira commented 1 year ago

I'm already looking at the first item.

At this point I don't want to change the representation of SRE opcodes, so, to store a string literal, I'm adding a literals array to the pattern object and referencing the string by its index; at a later time, making it inline or storing it in the info block (to improve prefetching, etc) is an option that I'll probably look at.

lpereira commented 1 year ago

I have a working prototype in this branch. Needs quite a bit of cleanup but it seems to be working as expected.

mdboom commented 1 year ago

This is cool. I think you probably already know this, but wanted to highlight in advance that the regex benchmarks in pyperformance aren't very representative of real world usage. I think they will be fine as guardrails against any unintended regressions, but I don't think they will be super useful for creating a good sense of "X faster". That said, there's lots of interesting regexs in the stdlib that do things that people probably do all the time (email, json, toml, csv, urllib, etc.) A benchmark over all the regexes that actually appear in the stdlib would probably be a much better approximation. Pulling all of the regexes across all the IETF standards would be an interesting exercise, too, since those tend to get used a lot of places in real code.

lpereira commented 1 year ago

Other things I've found that look like good candidates:

JelleZijlstra commented 1 year ago

For instance, [0-9] could be folded to \d

That's not entirely correct since other Unicode digits can also match \d. According to the docs [0-9] is equivalent to \d only if we're using a bytes pattern or the re.ASCII flag is used. My intuition would be that the vast majority of regex executions don't fall into those categories.

jonashaag commented 1 year ago
  • Similar to the previous possible optimization: matching something like [0-9]*-[a-z]{1,4} should look for - first using a fast byte scan, backwards matching [0-9]*, and then finally [a-z]{1,4}.

A generalization of this is to find a "plain" substring in the regex that you can look for with strstr as "prematcher"/"bloom filter". We've had some success with this approach.

markshannon commented 1 year ago

Any progress on this?

lpereira commented 1 year ago

I paused the work on this but I intend to resume it soon.