VerbalExpressions / RVerbalExpressions

:speech_balloon: Create regular expressions easily
https://rverbalexpressions.netlify.com/
Other
280 stars 12 forks source link

Character sets #9

Open dmi3kno opened 5 years ago

dmi3kno commented 5 years ago

Problem

I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in rx:

regex-example

Challenges

First of all, I dont know of the way to express single "word" character (alnum + _). We used rx_word to denote \\w+ and perhaps it should have been rx_word_char() %>% rx_one_or_more().

rx_char <- function(.data = NULL, value=NULL) {
  if(missing(value))
    return(paste0(.data, "\\w"))
  paste0(.data, sanitize(value))
}

I also extended rx_count to cases of ranges of input

rx_count <- function(.data = NULL, n = 1) {
  if(length(n)>1){
    n[is.na(n)]<-""
    return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
  }
  paste0(.data, "{", n,"}")
}

Finally, we dont have a way to express word boundaries (\\b) and it might be useful to denote them. We shall call this function rx_word_edge

rx_word_start <- function(.data = NULL){
  paste0(.data, "\\b")
}

rx_word_end <- rx_word_start

Finally, our biggest problem is that there's no way to express groups of characters, other than through rx_any_of(), but if we pass other rx expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.

# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
  paste0(.data, "[", value, "]")
}

Solution

Here's what it looks like when we put all pieces together:

x <- rx_word_start() %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".%+-")
  ) %>%
  rx_one_or_more() %>% 
  rx_char("@") %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".-")
  ) %>% 
  rx_one_or_more() %>% 
  rx_char(".") %>% 
  rx_alpha() %>% 
  rx_count(2:6) %>% 
  rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"

txt <- "This text contains email first.last@gmail.com and noname@post.io. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "first.last@gmail.com" "noname@post.io"  
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "first.last@gmail.com" "noname@post.io"  

The code works but I don't like it.

  1. Constructor rx look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below).
  2. It is not very clear what rx_one_or_more() is referring to. I wonder if all functions should have rep argument with default option one and options some/any in addition to what rx_count does today.
  3. Should rx_char() without arguments be called rx_wordchar?
  4. Should rx_char() with arguments be called rx_literal() or rx_plain?
  5. We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
  6. rx_group is artificial construct, a duplicate of rx_any_of, but without sanitization. Here I see couple of solutions. a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type of value argument is not character, but rx_string. Input of this class do not need to be sanitized, because it has been sanitized at creation. b. Do not allow "nested pipes". Instead define rx_any_of() to have ... and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:
    rx_word_edge() %>% 
    rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
    rx_literal("@") %>% 
    rx_any_of(rx_wordchar(), ".-", rep="some") %>% 
    rx_literal(".") %>% 
    rx_alpha(rep=2:6) %>% 
    rx_word_edge()

    It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.

tylerlittlefield commented 5 years ago

Hi @dmi3kno I'm going to try and summarize just to make sure I understand this all.

Regarding the challenges:

Regarding the solution:

  1. rx() is redundant but I couldn't get away from needing to pass value parameter at the start so rx was the quick and dirty solution. So more than happy to find a more elegant solution to this. Is it the S3 class you mention or the ... or both?

  2. rx_one_or_more() isn't very clear in the nested pipes example. Using the example from your pull request, am I on the right path with this translation:

# old
x <- rx_word_edge() %>%
  rx_alpha() %>% 
  rx_one_or_more() %>%
  rx_word_edge()

# new
x <- rx_word_edge() %>%
  rx_alpha(rep = "any") %>% 
  rx_word_edge()
  1. Yes, rx_char() (or better yet, rx_literal() as you mention) implies things other than word characters so without an argument it should be rx_word_char() given that this would return \\w.

  2. If rx_char() behaves like I think it does in the example, rx_literal() sounds most fitting to me. rx_literal("@") literally gives you @ and nothing more.

  3. Letting the user know something is going to be sanitized sounds good but might use different words like "special characters will be escaped" or something, don't know if that's clearer to someone (including myself 😅) without much regex knowledge.

  4. I do not like nested pipes, I would prefer to avoid that! The second solution looks much cleaner.

With the latest version of RVerbalExpressions and some of the functions you wrote, the closest I can get without using the rep argument is:

library(RVerbalExpressions)

rx_word_char <- function(.data = NULL, value = NULL) {
  if(missing(value))
    return(paste0(.data, "\\w"))
  paste0(.data, sanitize(value))
}

rx_group <- function(.data = NULL, value) {
  paste0(.data, "[", value, "]")
}

rx_any_of <- function(.data = NULL, value, ...) {
  if(missing(...))
    return(paste0(.data, "[", sanitize(value), "]"))
  paste0(.data, "[", value, sanitize(...), "]")
}

rx_literal <- function(.data = NULL, value) {
  paste0(.data, value)
}

x <- rx_word_edge() %>% 
  rx_any_of(rx_word_char(), ".%+-") %>%
  rx_one_or_more() %>% 
  rx_literal("@") %>% 
  rx_any_of(rx_word_char(), ".-") %>% 
  rx_one_or_more() %>% 
  rx_word_char(".") %>% 
  rx_alpha() %>% 
  rx_count(n = 2:6) %>% 
  rx_word_edge()

txt <- "This text contains email first.last@gmail.com and noname@post.io. The latter is no longer valid."

stringr::str_extract_all(txt, x)[[1]]
#> [1] "first.last@gmail.com" "noname@post.io"

Looking at that long pipe makes the rep argument worth it to me. This would avoid 3 lines (lines 3, 6, and 9).

dmi3kno commented 5 years ago

Sorry for messy post. I was writing it and contributing new functions at the same time, so it reflects my own evolution of thinking. I will be more consistent going forward.

rx_literal <- function(.data=NULL, value) { res <- paste0(.data, sanitize(value)) class(res) <- unique(c("rx_string", class(res))) # to avoid accidental double "classing"
res }

But then you can do things like:
```r
rx() %>% 
  rx_one_of(rx_word_char(), rx_literal(value="?"), "abc")
#> [1] "[\\w\\?abc]"
tylerlittlefield commented 5 years ago

The only thing I feel we should watch out for, is that by adding all of these modifying arguments we are on the way back to complexity and away from intuitive interface. So I say we keep both rx_one_or_more() and rx_none_or_more() as well as implement more concise rep interface.

100% agree, I would rather have an intuitive API that does less rather than a somewhat clunky API that can do a whole lot. Given the number of functions that have been added, I wonder if a vignette covering common regex use cases and which functions to use would be helpful?

rx_one_of <- function(.data = NULL, ... ) {
  args <- sapply(list(...), function(x) if(inherits(x, "rx_string")) x else sanitize(x)) 
  args_str <- Reduce(paste0, args)
  paste0(.data, "[", args_str, "]")
}

Looks great, I haven't done much or any programming using ellipses but this looks much more elegant! Very excited about this.

dmi3kno commented 5 years ago

To do here: