Closed POPOVEL4 closed 3 years ago
The result is correct \s+
means space, " 1000 "
is the last token that matches this regex.
Most likely you meant \b[0-9]\b
.
Thank you for a quick reply, Marek!
There is a space in the end of the strings; if " 180 " matched regex in string a, it should also match regex in string b. " 180 " appears after " 1000 ", so it should be returned as result of stringi::stri_extract_last_regex(b, "\s+[:digit:]+\s+").
_Using \b won't work for me as it considers "." as \b symbol. Example: c <- "GELULE 5 2.3ML" stringi::stri_extract_lastregex(c, "\b[:digit:]+\b") // returns "2" instead of last stand alone integer "5" -> doesn't work for my purpose
@gagolews any update on this issue? thank you!
It's not a bug, dear @POPOVEL4 , it seems that you just need some help with your regex, and github issues is not the right spot for this ...
Try look ahead/look behind assertions https://www.regular-expressions.info/lookaround.html
Thanks for reply @gagolews Let's take complexity of my regex out of equation here :)
s1 <- "abaca" stringi::stri_extract_last_regex(s1, "a[:alpha:]a") [1] "aba" s2 <- "abaaca" stringi::stri_extract_last_regex(s2, "a[:alpha:]a") [1] "aca"
Question: Shouldn't the result of first function call be "aca"?
Thoughts: It seems like the initial string is being partitioned by pattern matched from left to right: s1 as "aba" + "ca" s2 as "aba" + "aca" which explains the results of example above and works for stri_extract_first_regex function but is not ideal for stri_extract_last_regex. What do you think?
Heh, you're right, I should've given it more attention. :/
The matching is - and will be - done from the beginning of the string to the end, so the overlapping matches will not be identified correctly. The result of stri_*_last_regex
is essentially the last bit reported by stri_*_all_regex
. This is how the ICU regex engine is implemented and I cannot do anything about it (I mean, I could rewrite it, but, oh well...).
Thanks for noticing that though, I will update the manual and the tutorial to make this clear.
Also note that stri_*_fixed
is equipped with the overlap
option.
A work-around for stri_*_regex
could rely upon a call to stri_reverse
and matching with stri_*_first
?
Hope this helps
Understood, thank you!
No problem, we indeed have 2 workarounds and now testing which one is better: _1) use reverse/extractfirst/reverse _2) use extractlast but add look ahead/look behind assertions to regex to avoid punctuation (as per your advice)
how about what follows?
> stringi::stri_extract_last_regex(c("TAB SA FC 24 50 1000X 180 ", "TAB SA FC 24 50 1000 180 "), "\\b[:digit:]+\\b")
[1] "180" "180"
And also:
> stringi::stri_match_last_regex(c("abaca", "abaaca"), ".*(a[:alpha:]a)")[,2]
[1] "aca" "aca"
how about what follows?
> stringi::stri_extract_last_regex(c("TAB SA FC 24 50 1000X 180 ", "TAB SA FC 24 50 1000 180 "), "\\b[:digit:]+\\b") [1] "180" "180"
Doesn't work for float numbers, but I extended it with assertions and it seems to fit purpose now.
a <- "TAB SA FC 24 50.6 10,50 180X" stringi::stri_extract_last_regex(a, "\\b[:digit:]+\\b") [1] "50" stringi::stri_extract_last_regex(a, "(?<![:punct:])\\b[:digit:]+\\b(?![:punct:])") [1] "24"
From my side, the issue is closed. I understand that due to how the ICU regex engine is implemented, it won't be an easy fix. So we just stick to one of the alternative options, not a problem!
Thank you very much for engaging and proposing alternative solutions :)
Great, I'll keep the issue open though until I update the manual.
Goal: Extract last "stand alone" integer out of the string
a <- "TAB SA FC 24 50 1000X 180 " stringi::stri_extract_last_regex(a, "\s+[:digit:]+\s+") // returns " 180 " -> correct
b <- "TAB SA FC 24 50 1000 180 " stringi::stri_extract_last_regex(b, "\s+[:digit:]+\s+") // returns " 1000 " -> incorrect
Comment: stringi::stri_extract_first_regex seems to work fine
Thanks!