gagolews / stringi

Fast and portable character string processing in R (with the Unicode ICU)
https://stringi.gagolewski.com/
Other
304 stars 44 forks source link

stringi::stri_extract_last_regex BUG #406

Closed POPOVEL4 closed 3 years ago

POPOVEL4 commented 4 years ago

Goal: Extract last "stand alone" integer out of the string

a <- "TAB SA FC 24 50 1000X 180 " stringi::stri_extract_last_regex(a, "\s+[:digit:]+\s+") // returns " 180 " -> correct

b <- "TAB SA FC 24 50 1000 180 " stringi::stri_extract_last_regex(b, "\s+[:digit:]+\s+") // returns " 1000 " -> incorrect

Comment: stringi::stri_extract_first_regex seems to work fine

Thanks!

gagolews commented 4 years ago

The result is correct \s+ means space, " 1000 " is the last token that matches this regex.

Most likely you meant \b[0-9]\b.

POPOVEL4 commented 4 years ago

Thank you for a quick reply, Marek!

There is a space in the end of the strings; if " 180 " matched regex in string a, it should also match regex in string b. " 180 " appears after " 1000 ", so it should be returned as result of stringi::stri_extract_last_regex(b, "\s+[:digit:]+\s+").

_Using \b won't work for me as it considers "." as \b symbol. Example: c <- "GELULE 5 2.3ML" stringi::stri_extract_lastregex(c, "\b[:digit:]+\b") // returns "2" instead of last stand alone integer "5" -> doesn't work for my purpose

POPOVEL4 commented 4 years ago

@gagolews any update on this issue? thank you!

gagolews commented 4 years ago

It's not a bug, dear @POPOVEL4 , it seems that you just need some help with your regex, and github issues is not the right spot for this ...

Try look ahead/look behind assertions https://www.regular-expressions.info/lookaround.html

POPOVEL4 commented 4 years ago

Thanks for reply @gagolews Let's take complexity of my regex out of equation here :)

s1 <- "abaca" stringi::stri_extract_last_regex(s1, "a[:alpha:]a") [1] "aba" s2 <- "abaaca" stringi::stri_extract_last_regex(s2, "a[:alpha:]a") [1] "aca"

Question: Shouldn't the result of first function call be "aca"?

Thoughts: It seems like the initial string is being partitioned by pattern matched from left to right: s1 as "aba" + "ca" s2 as "aba" + "aca" which explains the results of example above and works for stri_extract_first_regex function but is not ideal for stri_extract_last_regex. What do you think?

gagolews commented 4 years ago

Heh, you're right, I should've given it more attention. :/

The matching is - and will be - done from the beginning of the string to the end, so the overlapping matches will not be identified correctly. The result of stri_*_last_regex is essentially the last bit reported by stri_*_all_regex. This is how the ICU regex engine is implemented and I cannot do anything about it (I mean, I could rewrite it, but, oh well...).

Thanks for noticing that though, I will update the manual and the tutorial to make this clear.

Also note that stri_*_fixed is equipped with the overlap option.

A work-around for stri_*_regex could rely upon a call to stri_reverse and matching with stri_*_first?

Hope this helps

POPOVEL4 commented 4 years ago

Understood, thank you!

No problem, we indeed have 2 workarounds and now testing which one is better: _1) use reverse/extractfirst/reverse _2) use extractlast but add look ahead/look behind assertions to regex to avoid punctuation (as per your advice)

gagolews commented 4 years ago

how about what follows?

> stringi::stri_extract_last_regex(c("TAB SA FC 24 50 1000X 180 ", "TAB SA FC 24 50 1000 180 "), "\\b[:digit:]+\\b")
[1] "180" "180"
gagolews commented 4 years ago

And also:

> stringi::stri_match_last_regex(c("abaca", "abaaca"), ".*(a[:alpha:]a)")[,2]
[1] "aca" "aca"
POPOVEL4 commented 4 years ago

how about what follows?

> stringi::stri_extract_last_regex(c("TAB SA FC 24 50 1000X 180 ", "TAB SA FC 24 50 1000 180 "), "\\b[:digit:]+\\b")
[1] "180" "180"

Doesn't work for float numbers, but I extended it with assertions and it seems to fit purpose now.

a <- "TAB SA FC 24 50.6 10,50 180X" stringi::stri_extract_last_regex(a, "\\b[:digit:]+\\b") [1] "50" stringi::stri_extract_last_regex(a, "(?<![:punct:])\\b[:digit:]+\\b(?![:punct:])") [1] "24"

POPOVEL4 commented 4 years ago

From my side, the issue is closed. I understand that due to how the ICU regex engine is implemented, it won't be an easy fix. So we just stick to one of the alternative options, not a problem!

Thank you very much for engaging and proposing alternative solutions :)

gagolews commented 4 years ago

Great, I'll keep the issue open though until I update the manual.