dlclark / regexp2

A full-featured regex engine in pure Go based on the .NET engine
MIT License
987 stars 83 forks source link

Running MatchString is slow #45

Closed liran closed 2 years ago

liran commented 2 years ago

Run the following example (https://go.dev/play/p/BDU6yN5NvEZ):

package main

import (
    "log"
    "regexp"
    "time"

    "github.com/dlclark/regexp2"
)

func main() {
    url := "https://www.dhgate.com/product/magnetic-liquid-eyeliner-magnetic-false-eyelashes/481362313.html"

    reg1 := regexp.MustCompile(`dhgate(?:.[a-z]+)+\/product\/`)
    log.Println("start regexp match string...")
    begin := time.Now()
    reg1.MatchString(url)
    log.Println("time taken:", time.Since(begin))

    reg2 := regexp2.MustCompile(`dhgate(?:.[a-z]+)+\/product\/`, regexp2.IgnoreCase)
    log.Println("start regexp2 match string...")
    begin = time.Now()
    reg2.MatchString(url)
    log.Println("time taken:", time.Since(begin))
}

output:

2021/12/08 14:16:30 start regexp match string...
2021/12/08 14:16:30 time taken: 21.583µs
2021/12/08 14:16:30 start regexp2 match string...

regexp2 version is v1.4.0 Hope it helps to improve performance.

dlclark commented 2 years ago

Your regex has catastrophic backtracking in it due to the nested repetition that allows any character (.). This type of backtracking isn't possible in RE2 and prevents these kinds of issues, but it also limits the features that RE2 can have.

To prevent the catastrophic backtracking you'll want to use \. in your pattern to match the literal dot instead of any character: dhgate(?:\.[a-z]+)+\/product\/