VerbalExpressions / RVerbalExpressions

:speech_balloon: Create regular expressions easily
https://rverbalexpressions.netlify.com/
Other
280 stars 12 forks source link

Add lookarounds #3

Closed tylerlittlefield closed 5 years ago

tylerlittlefield commented 5 years ago

Add ways to express lookarounds. This was brought up by @dmi3kno and he mentioned a pretty intuitive way using step_ahead() and step_back()

Source: https://twitter.com/dmi3k/status/1103401979152355328

lookarounds

dmi3kno commented 5 years ago

Suggested functions (need good unit testing);

step_ahead <- function(.data=NULL){
  val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
  if (!length(val)) return(.data)
  post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
  pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
  paste0(pre, "(?<=", post)
  }
step_back <- function(.data=NULL){
  val <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*(?=\\)\\??$)", .data, perl = TRUE))
  if (!length(val)) return(.data)
  post <- regmatches(.data, regexpr("(?<=\\(\\?\\:)[^\\(\\?\\:]*$", .data, perl = TRUE))
  pre <- regmatches(.data, regexpr("^.*(?=\\(\\?\\:[^\\(\\?\\:]*$)", .data, perl = TRUE))
  paste0(pre, "(?=", post)
}

These function boil down to trailing back and modifying previous find or maybe, if detected.

Since step_ahead() and step_back() are modifiers of find, perhaps they can be an argument in find (with default being step=0 or step=NULL):

x <- find("(", step='after') %>%     # or find("(", step=1)
  begin_capture() %>% 
  anything() %>% 
  find(")", step='before') %>%     # or find(")", step=-1)
  end_capture()

x
#> [1] "((?<=\\()(?:.*)(?=\\)))"

This does not cover negative lookarounds. The water is getting pretty deep already with my complex regex, so perhaps implementing (all kinds of) lookarounds would be easier as find modifiers. Or adverbs lookahead/lookbehind. Or synonyms stop_before(), start_after() ?

tylerlittlefield commented 5 years ago

Thanks for all the suggestions, this is awesome. I like your idea on adding an additional step argument to modify find().

I think the current regex in your example ((?<=\\()(?:.*)(?=\\))) would match "foo" in between something like "(extract) foo (me)", so maybe add a greedy argument to anything()? Otherwise, you could get by with something like:

x <- find(value = "(", step = 'forward') %>%
  anything_but(")") %>%
  find(")", step = 'backward')

stringr::str_extract_all("(extract) foo (me)", x)
#> [[1]]
#> [1] "extract" "me"  

Or with a greedy argument in anything():

z <- find(value = "(", step = "forward") %>% 
  anything(greedy = FALSE) %>% 
  find(")", step = "backward")

z
#> [1] "(?<=\\()(?:.*?)(?=\\))"

stringr::str_extract_all("(extract) foo (me)", z)
#> [[1]]
#> [1] "extract" "me"  

Reproducible example:

# in case you want to copy paste and run the example above
library(dplyr)

sanitize <- function(.data) {
  escape_chrs <- c(".", "|", "*", "?", "+", "(", ")", "{", "}", "^", "$", "\\", ":", "=", "[", "]")
  string_chrs <- strsplit(.data, "")[[1]]
  idx <- which(string_chrs %in% escape_chrs)
  idx_new <- paste0("\\", string_chrs[idx])
  paste0(replace(string_chrs, idx, idx_new), collapse = "")
}

# add greedy arg
anything <- function(.data = NULL, greedy = TRUE) {
  if(isTRUE(greedy)) {
    paste0(.data, "(?:.*)")
  } else if(isFALSE(greedy)){
    paste0(.data, "(?:.*?)")
  }
}

# add step arg
find <- function(.data = NULL, value, step = NULL) {
  if(is.null(step)) {
    paste0(.data, "(?:", sanitize(value), ")")
  } else if(step == "forward") {
    paste0(.data, "(?<=", sanitize(value), ")")
  } else if(step == "backward") {
    paste0(.data, "(?=", sanitize(value), ")")
  }
}
dmi3kno commented 5 years ago

I think greedy as argument looks a bit ugly. How about making lazy (non-greedy) versions of anything() and everything()?

regex                 greedy                    non-greedy
 .*                   anything()                whatever()
 .+                   everything()              something()    

I think find() as it stands now, should only initiate non-capturing group. We need another group of verbs for creating lookarounds (positive and negative): seek_suffix, seek_prefix and avoid_suffix, avoid_prefix.

seek_prefix <- function(.data = NULL, value) {
    paste0(.data, "(?<=", sanitize(value), ")")
}

seek_suffix <- function(.data = NULL, value) {
    paste0(.data, "(?=", sanitize(value), ")")
}

avoid_prefix <- function(.data = NULL, value) {
    paste0(.data, "(?<!", sanitize(value), ")")
}

avoid_suffix <- function(.data = NULL, value) {
    paste0(.data, "(?!", sanitize(value), ")")
}

I also think that exact number of repetitions can be expressed as count() (or n() or repeated()):

count <- function(.data = NULL, n = 1) {
  paste0(.data, "{", n,"}")
}

Here are some unit tests for lookarounds, all returning single value 100:

# positive lookahead
x <- start_of_line() %>% 
  digit() %>% count(3) %>% 
  seek_suffix(" dollars")
x
stringr::str_extract_all("100 dollars", x)

# negative lookahead
x <- start_of_line() %>% 
  digit() %>% count(3) %>%
  avoid_suffix(" dollars")
x
stringr::str_extract_all("100 pesos", x)

# positive lookbehind
x <- seek_prefix(value="USD") %>% 
  digit() %>% count(3)
x
stringr::str_extract_all("USD100", x)

#negative lookbehind
x <- avoid_prefix(value="USD") %>% 
  digit() %>% count(3)
x
stringr::str_extract_all("JPY100", x)

Finally, as Hadley suggested you need to start thinking about prefix for the function names to avoid namespace collisions. I suggest we go for rx_, so it would be rx_whatever(), rx_digit() or rx_count().

tylerlittlefield commented 5 years ago

Thanks for this! One thing to keep in mind is that the following currently exist:

  1. anything()
  2. anything_but()
  3. something()
  4. something_but()

Where:

# matches anything, including nothing i.e. an empty character
anything()
#> [1] "(?:.*)"
anything_but(value = "foo")
#> [1] "(?:[^foo]*)"

# matches something, excluding nothing
something()
#> [1] "(?:.+)"
something_but(value = "foo")
#> [1] "(?:[^foo]+)"

grepl(anything(), "")
#> [1] TRUE
grepl(something(), "")
#> [1] FALSE

I like the idea of anything(), whatever(), everything(), and something() but they all sound greedy to me. What about anything_lazy(), something_lazy()?

This would create 3 options for each, a total of 6 functions:

  1. anything() matches literally anything, including nothing.
  2. anything_but() matches anything but whatever you give it.
  3. anything_lazy() matches anything as little as needed.
  4. something() matches something, excluding nothing.
  5. something_but() matches something but whatever you give it.
  6. something_lazy() matches something as little as needed.

You could then do something like:

something_lazy <- function(.data = NULL) {
  paste0(.data, "(?:.+?)")
}

anything_lazy <- function(.data = NULL) {
  paste0(.data, "(?:.*?)")
}

x <- seek_prefix(value = "(") %>% 
  something_lazy() %>% 
  seek_suffix(")")
x
#> [1] "(?<=\\()(?:.+?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", x)
#> [[1]]
#> [1] "extract" "me"

y <- seek_prefix(value = "(") %>% 
  anything_lazy() %>% 
  seek_suffix(")")
y
#> [1] "(?<=\\()(?:.*?)(?=\\))"
stringr::str_extract_all("(extract) foo (me) then anything ()", y)
#> [[1]]
#> [1] "extract" "me"      ""

Not sure if this is the right way, but for whatever reason something() and anything() make sense to me.

Also, the lookaround functions and count() look great. rx_ sounds like a good prefix as well. I might just pull the trigger and add rx_ once I get home. Take a look at #1 as well, do you like rx_ better than vex_? I like both, rx_ is nice because it's shorter.

dmi3kno commented 5 years ago

After some thinking I agree that introducing more synonyms for non-greedy variants of existing functions is a bad idea (I like the implicit "lazyness" of whatever() though :) ).

Having said that, I don't like the _lazy suffix. There are potentially many functions that may need to be turned non-greedy. Examples are one_or_more() (the + modifier and the * counterpart, which you could call none_or_more()), or even the anything_but(), something_but() and generally every function that results in regex piece ending with + or * modifier).

Could the "non-greediness" be turned on in any of these by an argument called lazy? We can even leave greedy=!lazy for more advanced users (who know what "greediness" is).

I think more fundamental decision that we seem to have landed on is that the default verbs should be greedy (to match default perl-style regex behavior). Although I just want to drop it in here that we could think of a rx world where lazy verbs are defaults and you need to turn greediness on. This is a difficult world to comprehend for me right now.

References: Lazy vs greedy

P.S. One afterthought: when inventing new verbs, we should probably try and stay closer to the words that have been implemented in other languages of VerbalExpressions. That would honor the work done by others and make transitions between languages smoother. We are free to invent arguments, though. This means that anything(), anything_but() are here to stay.

tylerlittlefield commented 5 years ago

Good point, _lazy() isn't going to cut it. rex uses: type = c("greedy", "lazy", "possessive"), what do you think? With the prefix and constructor, it would look like:

x <- rx() %>% 
  seek_prefix("(") %>% 
  anything(type = "lazy") %>% 
  seek_suffix(")")

Regarding lazy by default, I like the idea but regular expressions have been greedy by default for so long that it may just be confusing. There is a thread about this here.

dmi3kno commented 5 years ago

LGTM. Do you want a PR?

tylerlittlefield commented 5 years ago

Up to you. I don't mind adding the changes but I'd rather let people contribute if they want. So just let me know 😄

tylerlittlefield commented 5 years ago

Oh and if you do submit a PR, please let me know what you're working on so we don't duplicate efforts. I worked a bit on adding the rx_ prefix last night so no need to work on that.

dmi3kno commented 5 years ago

I am adding lookarounds and implementing type (lazy/greedy) argument in relevant functions

dmi3kno commented 5 years ago

Last question: do you think the argument should be called type or mode? Or yet something else?

tylerlittlefield commented 5 years ago

I think mode sounds better than type.