Large Input (length : 1 Lakh) and pattern has 3999 capture groups, I think code is going into infinite loop or taking so long time

PCRE2Project / pcre2

PCRE2 development is now based here.

Other

922 stars 194 forks source link

Large Input (length : 1 Lakh) and pattern has 3999 capture groups, I think code is going into infinite loop or taking so long time #279

Closed paragikjain closed 1 year ago

paragikjain commented 1 year ago

Below is a basic C code snippet Link. When the input contains 100,000 elements and there are 3,999 capture groups, the pcre_match function is running extremely slowly. I'm unsure whether it's stuck in an infinite loop. How can I enhance the performance of my code given this large number of capture groups? It's important to note that I cannot control the input and pattern as they will be provided by the user.

Code For the above-Mentioned Issue

Please help!!! Thanks,

PhilipHazel commented 1 year ago

Handling that many capture groups takes time. If you set the PCRE2_NO_AUTO_CAPTURE option, which turns the capture groups into non-capture groups, the program completes fairly quickly. I also tried your pattern and subject with JIT and the same is true - it takes a very long time when the groups are all captures.

paragikjain commented 1 year ago

Thank you @PhilipHazel !!

Is there any API PCRE2 provides from which we can re-write pattern optimally. for example (d)(d)(d)(d) can we rewritten as (d){4}.

zherczeg commented 1 year ago

I made a project for that purpose: https://github.com/zherczeg/repan It can, for example, remove those capturing groups, which is not used by the pattern.

PhilipHazel commented 1 year ago

There is actually a difference between (d)(d)(d)(d) and (d){4} because the first one has four different capturing groups but the second one has only one, repeated of course. If you change your pattern to be (d){3999}e the match completes much more quickly, though with JIT it hits the JIT stack limit.

paragikjain commented 1 year ago

Thanks @PhilipHazel, few final questions I have:

Can we increase JIT stack limit, if yes what API we need to use?
Is there any API PCRE2 provides to Timeout if pcre2_match taking more than x seconds ?

Thanks, in advance.

zherczeg commented 1 year ago

1) yes, https://www.pcre.org/current/doc/html/pcre2jit.html#SEC7 2) a match-try counter is provided by the api. It is not based on seconds though, it decreases for every match attempt, and the match fails if it reaches 0. The "match attempt" is not precisely defined.

PhilipHazel commented 1 year ago

Check out the various pcre2set.... functions for various limits that can be set.

paragikjain commented 1 year ago

thank you @zherczeg and @PhilipHazel for you support !!!!!