fholm / IronJS

IronJS - A JavaScript implementation for .NET
http://ironjs.wordpress.com
Apache License 2.0
680 stars 79 forks source link

The .NET regular expression engine's capturing behavior is not the same as the ECMAScript standard. #24

Open otac0n opened 13 years ago

otac0n commented 13 years ago

For regular expressions such as this: ((a+)?(b+)?c+)*

There are 3 capturing groups (one for each left-parenthesis).

If this is matched against a string like the following: bbbccaac

The .NET implementation will list the following capture groups: ((a+)?(b+)?c) = "aac" (a+) = "aa" (b+) = "bbb"

Whereas the ECMAScript spec specifies the following capturing behavior: ((a+)?(b+)?c) = "aac" (a+) = "aa" (b+) = undefined

The .NET implementation gives no indication that the (b+) capturing group did not participate in its most recent match attempt.

hakanson commented 13 years ago

Does using RegexOptions.ECMAScript help?

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions(v=VS.100).aspx

otac0n commented 13 years ago

@hakanson: We are already using the ECMAScript option, which works well for the most part. It is just this little piece that is different.

fholm commented 13 years ago

I think this is something we'll have to live with for now, doing a custom regular expression implementation for this small detail is too much for too little gain currently. I'll leave the ticket open, and we'll look into it eventually.

hakanson commented 13 years ago

-1 for me for not looking in the code in Core.fs

    let options = (options ||| RegexOptions.ECMAScript) &&& ~~~RegexOptions.Compiled
    let key = (options, pattern)
    this.RegExp <- env.RegExpCache.Lookup key (fun () -> new Regex(pattern, options ||| RegexOptions.Compiled))

I'm new to F#; does this mean you are implementing your own compiled RegExp cache? I ask because there is a Regex.CacheSize Property that controls an internal cache of compiled regular expressions. I assume it gave you more control to have your own cache, but thought I would add for completeness (as the risk of looking uninformed a second time on the same issue).

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.cachesize.aspx

fholm commented 13 years ago

Yes we do maintain our own regexp cache, we found it to be faster actually.

otac0n commented 13 years ago

We found that in a loop like this...

while (true)
{
    var r = new RegExp("...");
}

...that .NET's regex cache was not helping.

When we implemented the regexp cache shown above, we saw a 50% reduction in the time on the SunSpider regexp test.

ChaosPandion commented 13 years ago

@otac0n - From the looks of it the BCL only caches for static methods on the Regex object so the increase in performance makes sense.