Speed up regular expressions

GoogleCodeExporter commented 9 years ago

CPython's regex implementation is slower than it could be. There are a lot
of techniques we could use to make things faster:

- JIT compile regexes. V8 and SquirrelFish Extreme have had good luck with
this approach.
- Thompson NFAs are much faster for some regexes, but don't support the
full range of Python regex syntax. This would necessitate a multi-engine
system that picks the fastest engine for the given regex.
- Apply VM optimizations from the main Python eval loop to the regex eval
loop. This isn't likely to give us the speed we want, but may help in a
multi-engine approach if one of those engines is the existing regex VM.

Longer-form, with links:
http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Regular_Expressions

Original issue reported on code.google.com by collinw on 14 Apr 2009 at 11:30

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 14 Apr 2009 at 11:38

GoogleCodeExporter commented 9 years ago

As discussed in the Chromium blog post, even '|' in Python regexps, which I 
would
guess is extremely common, requires alternatives to be tried from left to 
right. 
Perhaps there's a clever way to implement that within an NFA, but it's not 
obvious. 
If you leave out alternation, then the only interesting operators you have left 
are
character classes and repetition.  Special casing those may be worth it, though,
especially since regexps are composed out of units like that.

Original comment by reid.kle...@gmail.com on 15 Apr 2009 at 10:06

GoogleCodeExporter commented 9 years ago

Yeah, I saw your comment on that blog post. There's a lot more details (like 
the bit
about | and &) in
http://code.google.com/p/unladen-swallow/wiki/ProjectPlan#Regular_Expressions, 
I just
didn't want to duplicate it all here.

Original comment by collinw on 15 Apr 2009 at 10:13

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 27 May 2009 at 9:44

Added labels: Performance

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 27 May 2009 at 11:06

Added labels: Release-2009Q3

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 27 May 2009 at 11:07

Added labels: Type-Enhancement

GoogleCodeExporter commented 9 years ago

Moving to Fredrik, since he's volunteered to work on this.

Original comment by collinw on 29 May 2009 at 4:12

GoogleCodeExporter commented 9 years ago

Hi, for my own amusement I'm playing around with implementing regular 
expressions on
top of LLVM. Right now I'm taking an NFA/DFA approach but if there's interest I 
might
work on a backtracking implementation too. It seems to me that if you're going 
to use
LLVM as JIT for Python code it might make sense as your JIT for regular 
expression
code too. I'll keep an eye on your work and post if I complete anything that 
might be
useful.

Original comment by i...@mckellar.org on 20 Jun 2009 at 4:21

GoogleCodeExporter commented 9 years ago

> it might make sense as your JIT for regular expression code

Absolutely, and that's exactly what we plan to do for the first iteration - the 
current 
regex engine (SRE) is basically a simple "word code" engine, and generating 
LLVM IR 
instead of word code from SRE's intermediate representation should be pretty 
straightforward.

Original comment by fredrik.lundh on 20 Jun 2009 at 4:30

GoogleCodeExporter commented 9 years ago

Alright, so I've been playing with this a bit and have started building a regex
implementation on top of LLVM. It builds on SRE's parser so that it will accept
exactly the same grammar as the existing implementation. It uses LLVM to JIT and
reuses a few of the unladen-swallow LLVM tools like the optimization support and
PyGlobalLlvmData to get the ExecutionEngine. It's half C++ and half Python. The 
C++
side is for compilation and the Python side provides interface compatibility 
with SRE.

It's pretty incomplete. The Python interface just supports matching right now. 
The
only flag it respects is DOTALL. It supports greedy repetition, literals, 
character
ranges, wildcards (.) and subexpression extraction. The rest shouldn't be too 
hard.

I'm hoping to get complete SRE compatibility, then look at improving the 
quality of
the bytecode being generated, then hopefully look at using Thompson DFA 
techniques
for regular expressions that will allow it. I previously built an LLVM Thompson 
DFA
fnmatch implementation for fun:
http://ianloic.com/2009/07/15/an-fnmatch-implementation-using-finite-state-machi
nes-and-llvm/

The code is pretty messy right now. Most things have been renamed a couple of 
times
and need to be renamed a couple more times. I also haven't looked at 
performance at all.

Is anyone else working on this? Should I take this bug? I'm working mostly 
offline
since I'm staying on the top of a mountain in the French Alps right now so I'm 
not on
email very often.

Original comment by i...@mckellar.org on 27 Jul 2009 at 8:51

Attachments:

llvm-re-1.diff

GoogleCodeExporter commented 9 years ago

I'm also working on this, but have been slowed down by work and vacations, and 
you seem 
to be a bit closer to something useful than I am.  Feel free to take the lead 
here; I 
can help out with tweaking, turning, and testing.

Original comment by fredrik.lundh on 27 Jul 2009 at 8:58

GoogleCodeExporter commented 9 years ago

(and feel free to mail me off-line if you want to discuss something in more 
detail; 
you'll reach me at fredrik at pythonware.com as usual).

Original comment by fredrik.lundh on 27 Jul 2009 at 8:59

GoogleCodeExporter commented 9 years ago

re.sub("turning", "tuning", message)

Original comment by fredrik.lundh on 27 Jul 2009 at 11:09

GoogleCodeExporter commented 9 years ago

ian.l.mckellar: cool! I'll try to play with your patch later this week.

I'm somewhat skeptical that a Thompson-based approach will be applicable to the 
majority of Python regexes, but I'm willing to be proven wrong. I know the V8 
team 
looked at Thompson NFAs for their new Irregexp engine, but found that they 
would 
only apply to a tiny slice of the regexes they see on the Internet, and I think 
that 
Python regex semantics are closer to Javascript regexes than to the semantics 
Thompson was targeting.

I know you said you haven't looked at performance yet, but when you do, our 
benchmark suite has a few different regex benchmarks in it already (see the 
Benchmarks wiki page). I just added a regex_compile benchmark so we can see how 
the LLVM-targeting compiler compares with the existing compiler. `./perf.py -r 
-b 
regex` will run all the regex benchmarks.

Original comment by collinw on 28 Jul 2009 at 9:18

Added labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Hi, as I mentioned I've been traveling so I've been offline and I'm only 
hacking every once in a while. 
Mostly when I've got some downtime. I just spent a few days in Tuscany and 
ended up staying up late to get 
my regex compiler basically working[*].

I've put my fork/branch up on github at: 
http://github.com/ianloic/unladen-swallow/tree/llvm-re
I'm using git-svn to track mainline unladen swallow. You should be able to 
build it from the llvm-re 
branch of my git repository and generate a patch by diffing that branch against 
master.

Right now if you "import re" you get SRE but if you "import llvmre as re" you 
get my stuff. I'll need to 
do some performance testing next. I'll be surprised if it's great. The codegen 
is function-heavy and I 
haven't worked out how to get the optimizer to inline as much as I want.

I think there are a few memory leaks in there too. Nothing fundamental but I 
haven't gone through the code 
and tidied it up enough yet. I'm sure it's possible to crash the module by 
passing the wrong arguments 
into the native module, I need to check arguments more strictly, but luckily 
the interface is pretty 
small.

I don't feel confident that I will have this code ready for release by the end 
of Q3 but I've got a few 
chunks of time over November, December and January when I know I'll have time 
to finish this off. 

[*] it passes most of the standard Python re tests but doesn't correctly 
backtrack group references, 
doesn't do backward assertions, and a couple of other things. generally 
real-world regular expressions 
will be correct.

Original comment by i...@mckellar.org on 13 Sep 2009 at 3:08

GoogleCodeExporter commented 9 years ago

And of course, I pushed before doing a clean rebuild and everything is broken. 
Some 
LLVM changes (around LLVMContext, mostly) broke everything and now that I've 
fixed it 
there are some memory leaks that are so bad they cause the performance tests to 
crash. 
I've pushed a fix that makes llvmre build (and it's now what you get when you 
"import 
re"), but I need to pull out valgrind and identify what's leaking so badly. 
I'll 
update when there's code out there that I'm not so ashamed of.

Original comment by i...@mckellar.org on 15 Sep 2009 at 8:56

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 26 Sep 2009 at 12:05

Added labels: Release-2009Q4

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 6 Jan 2010 at 11:41

Removed labels: Release-2009Q4

Daetalus / unladen-swallow

Speed up regular expressions #13