gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

stri_detect thousands of times slower than grepl in a pathological case #407

Closed aaronrudkin closed 4 years ago

aaronrudkin commented 4 years ago

I wrote an apparently problematic regex. I can see from running some external regex debuggers that the regex has a loop built into it which is nasty and hard to resolve. Some online debuggers say it takes over a million steps to get an answer. My fault. Not looking for regex help.

But this alerted me to the fact that in this pathological case, stri_detect takes a very long time to fail, more than 30 seconds on my desktop (Ryzen 7 3700 at 4.1Ghz) -- compared to grepl (200 microseconds). I am assuming the cause is that grepl internally decide a certain step count before they will give up on attempting to resolve and then give up, but stri_detect doesn't? That's just speculation on my part, I didn't go down the C rabbit hole. Oddly, neither produces an error. Neither a timeout error nor a parsing error.

I guess in order of preferences, mine would be:

Best: Have stri_detect error quickly when it reaches a regex it has trouble parsing and tell me how to fix it (yeah, right!) Good: Have stri_detect's timeout or termination behaviour more closely match grepl's OK: Add an argument to allow the user to terminate execution after a certain number of steps

The example below was generated using R 4.0.2 on Windows, stringi 1.5.3, bench 1.1.1. I've included two examples: one that is easily resolved by both (stri_detect beats grepl!) and then the pathological case.

library(tidyverse)
library(stringi)
speaker_paren_regex <- "^[[:upper:]]((\\w|-)*[.,]? ?){1,4}, ((\\w|\\d|-|\\/|\\.)* ?){1,6}:"
text_ok <- "John Smith, ABC:"

bench::mark(
  stri_detect = {stri_detect(text_ok, regex = speaker_paren_regex)},
  grepl = {grepl(speaker_paren_regex, text_ok)})
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 2 stri_detect    138us    142us     7030.      15KB     0   
#> 3 grepl          216us    221us     4507.        0B     0

text_bad <- "Mittwoch, den 11.11.2015"
bench::mark(
  stri_detect = {stri_detect(text_bad, regex = speaker_paren_regex)},
  grepl = {grepl(speaker_paren_regex, text_bad)})
#> # A tibble: 2 x 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 2 stri_detect    34.7s    34.7s    0.0288        0B        0
#> 3 grepl        231.2us  237.3us 4187.            0B        0

I also filed this downstream with https://github.com/tidyverse/stringr/issues/350

gagolews commented 4 years ago

This is by-design, and has been well-documented. It is a feature: Simply put, do not write problematic regexes (and problematic code in general). :)

See "Performance Tips" in http://userguide.icu-project.org/strings/regexp.

My recent paper/tutorial available at https://stringi.gagolewski.com/_static/vignette/stringi.pdf mentions this as well - see p. 39 and the time_limit or stack_limit option in stri_opts_regex()

Also, refer to http://qinwenfeng.com/re2r_doc/ and https://github.com/qinwf/re2r for discussion.

aaronrudkin commented 4 years ago

I figured as much -- time_limit/stack_limit are at least a little bit what I was looking for. And I now see why I missed it in the docs (because it's implemented via the separate options function).

Thank you for responding here. I will respond as well in the up-stream issue about exposing that functionality via the wrapper interface.