Closed tylerlittlefield closed 5 years ago
Suggested functions (need good unit testing);
step_ahead <- function(.data=NULL){
val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
if (!length(val)) return(.data)
post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
paste0(pre, "(?<=", post)
}
step_back <- function(.data=NULL){
val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
if (!length(val)) return(.data)
post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
paste0(pre, "(?=", post)
}
These function boil down to trailing back and modifying previous find
or maybe
, if detected.
Since step_ahead()
and step_back()
are modifiers of find
, perhaps they can be an argument in find
(with default being step=0
or step=NULL
):
x <- find("(", step='after') %>% # or find("(", step=1)
begin_capture() %>%
anything() %>%
find(")", step='before') %>% # or find(")", step=-1)
end_capture()
x
#> [1] "((?<=\\()(?:.*)(?=\\)))"
This does not cover negative lookarounds. The water is getting pretty deep already with my complex regex, so perhaps implementing (all kinds of) lookarounds would be easier as find
modifiers. Or adverbs lookahead
/lookbehind
. Or synonyms stop_before()
, start_after()
?
Thanks for all the suggestions, this is awesome. I like your idea on adding an additional step argument to modify find()
.
I think the current regex in your example ((?<=\\()(?:.*)(?=\\)))
would match "foo" in between something like "(extract) foo (me)", so maybe add a greedy argument to anything()
? Otherwise, you could get by with something like:
x <- find(value = "(", step = 'forward') %>%
anything_but(")") %>%
find(")", step = 'backward')
stringr::str_extract_all("(extract) foo (me)", x)
#> [[1]]
#> [1] "extract" "me"
Or with a greedy argument in anything()
:
z <- find(value = "(", step = "forward") %>%
anything(greedy = FALSE) %>%
find(")", step = "backward")
z
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me)", z)
#> [[1]]
#> [1] "extract" "me"
Reproducible example:
# in case you want to copy paste and run the example above
library(dplyr)
sanitize <- function(.data) {
escape_chrs <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
string_chrs <- strsplit(.data, "")[[1]]
idx <- which(string_chrs %in% escape_chrs)
idx_new <- paste0("\\", string_chrs[idx])
paste0(replace(string_chrs, idx, idx_new), collapse = "")
}
# add greedy arg
anything <- function(.data = NULL, greedy = TRUE) {
if(isTRUE(greedy)) {
paste0(.data, "(?:.*)")
} else if(isFALSE(greedy)){
paste0(.data, "(?:.*?)")
}
}
# add step arg
find <- function(.data = NULL, value, step = NULL) {
if(is.null(step)) {
paste0(.data, "(?:", sanitize(value), ")")
} else if(step == "forward") {
paste0(.data, "(?<=", sanitize(value), ")")
} else if(step == "backward") {
paste0(.data, "(?=", sanitize(value), ")")
}
}
I think greedy
as argument looks a bit ugly. How about making lazy (non-greedy) versions of anything()
and everything()
?
regex greedy non-greedy
.* anything() whatever()
.+ everything() something()
I think find()
as it stands now, should only initiate non-capturing group. We need another group of verbs for creating lookarounds (positive and negative): seek_suffix
, seek_prefix
and avoid_suffix
, avoid_prefix
.
seek_prefix <- function(.data = NULL, value) {
paste0(.data, "(?<=", sanitize(value), ")")
}
seek_suffix <- function(.data = NULL, value) {
paste0(.data, "(?=", sanitize(value), ")")
}
avoid_prefix <- function(.data = NULL, value) {
paste0(.data, "(?<!", sanitize(value), ")")
}
avoid_suffix <- function(.data = NULL, value) {
paste0(.data, "(?!", sanitize(value), ")")
}
I also think that exact number of repetitions can be expressed as count()
(or n()
or repeated()
):
count <- function(.data = NULL, n = 1) {
paste0(.data, "{", n,"}")
}
Here are some unit tests for lookarounds, all returning single value 100
:
# positive lookahead
x <- start_of_line() %>%
digit() %>% count(3) %>%
seek_suffix(" dollars")
x
stringr::str_extract_all("100 dollars", x)
# negative lookahead
x <- start_of_line() %>%
digit() %>% count(3) %>%
avoid_suffix(" dollars")
x
stringr::str_extract_all("100 pesos", x)
# positive lookbehind
x <- seek_prefix(value="USD") %>%
digit() %>% count(3)
x
stringr::str_extract_all("USD100", x)
#negative lookbehind
x <- avoid_prefix(value="USD") %>%
digit() %>% count(3)
x
stringr::str_extract_all("JPY100", x)
Finally, as Hadley suggested you need to start thinking about prefix for the function names to avoid namespace collisions. I suggest we go for rx_
, so it would be rx_whatever()
, rx_digit()
or rx_count()
.
Thanks for this! One thing to keep in mind is that the following currently exist:
anything()
anything_but()
something()
something_but()
Where:
# matches anything, including nothing i.e. an empty character
anything()
#> [1] "(?:.*)"
anything_but(value = "foo")
#> [1] "(?:[^foo]*)"
# matches something, excluding nothing
something()
#> [1] "(?:.+)"
something_but(value = "foo")
#> [1] "(?:[^foo]+)"
grepl(anything(), "")
#> [1] TRUE
grepl(something(), "")
#> [1] FALSE
I like the idea of anything()
, whatever()
, everything()
, and something()
but they all sound greedy to me. What about anything_lazy()
, something_lazy()
?
This would create 3 options for each, a total of 6 functions:
anything()
matches literally anything, including nothing.anything_but()
matches anything but whatever you give it.anything_lazy()
matches anything as little as needed.something()
matches something, excluding nothing.something_but()
matches something but whatever you give it.something_lazy()
matches something as little as needed.You could then do something like:
something_lazy <- function(.data = NULL) {
paste0(.data, "(?:.+?)")
}
anything_lazy <- function(.data = NULL) {
paste0(.data, "(?:.*?)")
}
x <- seek_prefix(value = "(") %>%
something_lazy() %>%
seek_suffix(")")
x
#> [1] "(?<=\\()(?:.+?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", x)
#> [[1]]
#> [1] "extract" "me"
y <- seek_prefix(value = "(") %>%
anything_lazy() %>%
seek_suffix(")")
y
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", y)
#> [[1]]
#> [1] "extract" "me" ""
Not sure if this is the right way, but for whatever reason something()
and anything()
make sense to me.
Also, the lookaround functions and count()
look great. rx_
sounds like a good prefix as well. I might just pull the trigger and add rx_
once I get home. Take a look at #1 as well, do you like rx_
better than vex_
? I like both, rx_
is nice because it's shorter.
After some thinking I agree that introducing more synonyms for non-greedy variants of existing functions is a bad idea (I like the implicit "lazyness" of whatever()
though :) ).
Having said that, I don't like the _lazy
suffix. There are potentially many functions that may need to be turned non-greedy. Examples are one_or_more()
(the +
modifier and the *
counterpart, which you could call none_or_more()
), or even the anything_but()
, something_but()
and generally every function that results in regex piece ending with +
or *
modifier).
Could the "non-greediness" be turned on in any of these by an argument called lazy
? We can even leave greedy=!lazy
for more advanced users (who know what "greediness" is).
I think more fundamental decision that we seem to have landed on is that the default verbs should be greedy (to match default perl-style regex behavior). Although I just want to drop it in here that we could think of a rx
world where lazy verbs are defaults and you need to turn greediness on. This is a difficult world to comprehend for me right now.
References: Lazy vs greedy
P.S. One afterthought: when inventing new verbs, we should probably try and stay closer to the words that have been implemented in other languages of VerbalExpressions
. That would honor the work done by others and make transitions between languages smoother. We are free to invent arguments, though. This means that anything()
, anything_but()
are here to stay.
Good point, _lazy()
isn't going to cut it. rex
uses: type = c("greedy", "lazy", "possessive")
, what do you think? With the prefix and constructor, it would look like:
x <- rx() %>%
seek_prefix("(") %>%
anything(type = "lazy") %>%
seek_suffix(")")
Regarding lazy by default, I like the idea but regular expressions have been greedy by default for so long that it may just be confusing. There is a thread about this here.
LGTM. Do you want a PR?
Up to you. I don't mind adding the changes but I'd rather let people contribute if they want. So just let me know 😄
Oh and if you do submit a PR, please let me know what you're working on so we don't duplicate efforts. I worked a bit on adding the rx_
prefix last night so no need to work on that.
I am adding lookarounds and implementing type
(lazy/greedy) argument in relevant functions
Last question: do you think the argument should be called type
or mode
? Or yet something else?
I think mode
sounds better than type
.
Add ways to express lookarounds. This was brought up by @dmi3kno and he mentioned a pretty intuitive way using
step_ahead()
andstep_back()
Source: https://twitter.com/dmi3k/status/1103401979152355328