JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.44k stars 5.46k forks source link

pcre error -27 (JIT stack limit) on long regex string #8278

Closed randyzwitch closed 10 years ago

randyzwitch commented 10 years ago

(Edit: Working off nightly 0.4 build) I'm making a package to parse Apache logs. See code here: https://github.com/randyzwitch/LogParser.jl

I'm fairly comfortable with the regex I wrote, having a 99% match rate on my test files. However, on one particularly gnarly string, I cause the following error:

julia> errorstring = """71.163.72.113 - - [30/Jul/2014:16:40:55 -0700] "GET emptymind.org/thevacantwall/wp-content/uploads/2013/02/DSC_006421.jpg HTTP/1.1" 200 492513 "http://images.search.yahoo.com/images/view;_ylt=AwrB8py9gdlTGEwADcSjzbkF;_ylu=X3oDMTI2cGZrZTA5BHNlYwNmcC1leHAEc2xrA2V4cARvaWQDNTA3NTRiMzYzY2E5OTEwNjBiMjc2YWJhMjkxMTEzY2MEZ3BvcwM0BGl0A2Jpbmc-?back=http%3A%2F%2Fus.yhs4.search.yahoo.com%2Fyhs%2Fsearch%3Fei%3DUTF-8%26p%3Dapartheid%2Bwall%2Bin%2Bpalestine%26type%3Dgrvydef%26param1%3D1%26param2%3Dsid%253Db01676f9c26355f014f8a9db87545d61%2526b%253DChrome%2526ip%253D71.163.72.113%2526p%253Dgroovorio%2526x%253DAC811262A746D3CD%2526dt%253DS940%2526f%253D7%2526a%253Dgrv_tuto1_14_30%26hsimp%3Dyhs-fullyhosted_003%26hspart%3Dironsource&w=588&h=387&imgurl=occupiedpalestine.files.wordpress.com%2F2012%2F08%2F5-peeking-through-the-wall.jpg%3Fw%3D588%26h%3D387&rurl=http%3A%2F%2Fwww.stopdebezetting.com%2Fwereldpers%2Fcompare-the-berlin-wall-vs-israel-s-apartheid-wall-in-palestine.html&size=49.0KB&name=...+%3Cb%3EApartheid+wall+in+Palestine%3C%2Fb%3E...+%7C+Or+you+go+peeking+through+the+%3Cb%3Ewall%3C%2Fb%3E&p=apartheid+wall+in+palestine&oid=50754b363ca991060b276aba291113cc&fr2=&fr=&tt=...+%3Cb%3EApartheid+wall+in+Palestine%3C%2Fb%3E...+%7C+Or+you+go+peeking+through+the+%3Cb%3Ewall%3C%2Fb%3E&b=0&ni=21&no=4&ts=&tab=organic&sigr=13evdtqdq&sigb=19k7nsjvb&sigi=12o2la1db&sigt=12lia2m0j&sign=12lia2m0j&.crumb=.yUtKgFI6DE&hsimp=yhs-fullyhosted_003&hspart=ironsource" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"""

julia> r = r"""([\d\.]+) ([\w.-]+) ([\w.-]+) (\[.+\]) "([^"\r\n]*|[^"\r\n\[]*\[.+\][^"]+|[^"\r\n]+.[^"]+)" (\d{3}) (\d+|-) ("(?:[^"]|\")+)"? ("(?:[^"]|\")+)"?"""

julia> match(r, errorstring)

error -27
while loading In[28], in expression starting on line 1

 in error at error.jl:21
 in exec at ./pcre.jl:136
 in match at ./regex.jl:119
 in match at ./regex.jl:133

Here's the man explanation page for -27: PCRE_ERROR_JIT_STACKLIMIT (-27)

   This error is returned when a pattern  that  was  successfully  studied
   using  a  JIT compile option is being matched, but the memory available
   for the just-in-time processing stack is  not  large  enough.  See  the
   pcrejit documentation for more details.

http://www.pcre.org/pcre.txt

This much of the regex works fine:

r"""([\d\.]+) ([\w.-]+) ([\w.-]+) (\[.+\]) "([^"\r\n]*|[^"\r\n\[]*\[.+\][^"]+|[^"\r\n]+.[^"]+)" (\d{3}) (\d+|-)"""

Any ideas what to do here or what the problem might be? Seems like a try/catch is the wrong way to handle this, it seems like a lower-level type of issue.

stevengj commented 10 years ago

See pcrestack on how to increase the PCRE stack size (or how to rearrange your regex to require less stack). It seems like it has to be done at compile time, and you may also need to increase the OS stack size.

dcjones commented 10 years ago

The default stack size is only 32KB. Maybe we should allocate one, say 1MB, stack and set all the regexes to use that when they're compiled.

This from the pcrejit manpage made me laugh:

(7) This is too much of a headache. Isn't there any better solution for JIT stack handling?

No, thanks to Windows. If POSIX threads were used everywhere, we could throw out this complicated API.

ViralBShah commented 10 years ago

It does seem reasonable to have a higher stack size, at least on linux and mac, if windows is a problem.

randyzwitch commented 10 years ago

Thanks for confirming that the issue is a small stack default @dcjones.

randyzwitch commented 10 years ago

Is there a simple setting I can modify while compiling from source to play around with different stack size values?

dcjones commented 10 years ago

Not super simple, but if pat is your regex pattern, you can do this and it should work.

ccall((:pcre_assign_jit_stack, :libpcre),
      Void, (Ptr{Void}, Ptr{Void}, Ptr{Void}), pat.extra, C_NULL,
      ccall((:pcre_jit_stack_alloc, :libpcre),
            Ptr{Void}, (Cint, Cint), 32768, 1048576))
dcjones commented 10 years ago

In that example 32768 is the initial stack size and 1048576 is the maximum.

randyzwitch commented 10 years ago

Thanks @dcjones! I tried this out on the bug example above and it worked, and tested it on a 350,000 array of Apache Log strings and didn't get any errors (which previously failed based on the example string).

Is this something that could be incorporated into Base easily or should I just build this fix into my package (or both)?

JeffBezanson commented 10 years ago

Yes I think we should use a bigger stack by default; 32k is extremely small. It seems like the only way to do this is for us to explicitly call pcre_assign_jit_stack for every regex? Or at least intercept the error, print a nice message and provide an easier way to do this.

dcjones commented 10 years ago

I was going to make a PR to set patterns to all use a 1mb stack, but am running into an issue. If I define globals in pcre.jl like so

const JIT_STACK_START_SIZE = 32768
const JIT_STACK_MAX_SIZE = 1048576
const JIT_STACK = ccall((:pcre_jit_stack_alloc, :libpcre), Ptr{Void},
                        (Cint, Cint), JIT_STACK_START_SIZE, JIT_STACK_MAX_SIZE)

JIT_STACK is always NULL. Yet it works from the repl. Why would that be?

simonster commented 10 years ago

@dcjones Maybe the ccall has to happen in __init__ since the pointer can't be saved in sys.so? Does it work if you remove sys.so/dylib/dll?

dcjones commented 10 years ago

Thanks @simonster, that was the issue.

stevengj commented 10 years ago

Isn't there a way to set the stack size when PCRE is compiled?

pao commented 10 years ago

That wouldn't help if your build used USE_SYSTEM_PCRE.

randyzwitch commented 10 years ago

Feels like a person building themselves and changing to use their own system PCRE would presumably know to change the stack size or have done it themselves? So if doing this at compile time takes an extra call out of every regex match function, that seems like a decent trader off to me.

Maybe just out a note in the make file to make sure stack size is large enough if you choose to use system PCRE?

randyzwitch commented 10 years ago

That's "trade off" and "put a note", iOS is not being good to me this morning

nalimilan commented 10 years ago

@randyzwitch People the least involved in Julia development are going to use distribution packages on Linux, and they'll use the system PCRE without even knowing it.

StefanKarpinski commented 10 years ago

Since it's simple for us to set the stack size at run time, I can't see why we wouldn't.