Character class helpers

VerbalExpressions / RVerbalExpressions

:speech_balloon: Create regular expressions easily

https://rverbalexpressions.netlify.com/

Other

280 stars 12 forks source link

Character class helpers #8

Closed dmi3kno closed 5 years ago

dmi3kno commented 5 years ago

Should we add generic character class helpers:

# rx_digit() # done
rx_alnum()
rx_alpha()
rx_lowercase()
rx_uppercase()
rx_space()
rx_punctuation() 
rx_whitespace()
rx_non_whitespace()
rx_tab()

tylerlittlefield commented 5 years ago

Agree, these need to be added. rx_whitespace, rx_tabexist as well. These use the \\s style instead of R's [[:space:]] style. We should decide which style we prefer. Not sure if there are benefits, consequences of either.

[x] rx_digit()
[x] rx_alnum()
[x] rx_alpha()
[x] rx_lowercase()
[x] rx_uppercase()
[x] rx_space()
[x] rx_punctuation()
[x] rx_whitespace()
[x] rx_tab()

The stuff checked off still needs a bit more tests. Will add more later.

dmi3kno commented 5 years ago

Could we add inverse argument for those functions where we have an inverse regex expression? What I am talking about is instead of ìmplementing rx_non_whitespace we implement rx_whitespace(inverse=TRUE) even though all of these functions will have inverse set to FALSE by default. I am a little unprepared to deal with ! and I feel number of functions is getting a little out of hand.

Let's try to stick with \\s style regex, where possible.

tylerlittlefield commented 5 years ago

Sounds good to me, the less functions the better! I will start adding the inverse arg later today.

tylerlittlefield commented 5 years ago

As of f4c27ff, most character classes (except upper and lowercase) have the inverse argument. I figured they aren't needed because if you didn't want lowercase, it would make more sense to use rx_uppercase() instead of rx_lowercase(inverse = TRUE)?

Another thought came to my mind regarding rx_something_but() and rx_anything_but() these are similar to inverse argument, for example if you did rx_something_but(rx_alpha()) except a + would be added and the expression enclosed between a none capturing group like so (?:[^[:alpha:]+) whereas rx_alpha(inverse = TRUE) gives [^[:alpha:]]. I still don't really understand groups but I thought I would bring this up.

dmi3kno commented 5 years ago

"Anything, but lower case" is not the same as "uppercase". You have numbers and special symbols. Uppercase will lock you to [A-Z]

dmi3kno commented 5 years ago

regarding rx_something_but() and rx_anything_but()

These are indeed similar, except that they "wrap" the group. We could call rx_something_but() a rx_none_of(rep="some"). They are most likely going to get the same ... interface as rx_one_of and allow heterogeneous argument sets like:

# I want to exclude matching N09847newest and N83678old
rx_anything_but("N", rx_digit(rep="some"), rx_alpha(rep="some"))

Your individual inverse=TRUE are useful in pipes, when you want to designate a character via-negativa, e.g. "a symbol other than uppercase here". Yes, you can everytime wrap positive rx statement into rx_anything_but, but that would make pipes ugly.

one_of and none_of (which anything_but is) are specially reserved for "grouped" operations, where you want to exclude a sequence of different statements (mix of literals and expressions), IMHO

tylerlittlefield commented 5 years ago

Ah, I see. Inverse added to upper/lower case functions.

tylerlittlefield commented 5 years ago

@dmi3kno When you get a chance, please review the inverse rules for character classes. I've just reverted them to the \\s style instead of R's [[:space:]]. There were a couple of things that I'm a little stuck on:

rx_digits inverse gives [^\\d", "{", n, "}] but this doesn't seem to work as expected. Should digits have an inverse argument?
rx_space if inverse is FALSE, it matches " " (space) and if TRUE it matches [^ ] (not space). Because it seems that \\s and \\S include more than just a space (newlines, carriage returns, tabs).

dmi3kno commented 5 years ago

I have droppedrx_digits in my branch.Now that we have compact rep argument, this function is a little redundant.

I will look at the rx_space again. It might be that rx_whitespace is enough and the rest is achievable with rx_literal

tylerlittlefield commented 5 years ago

I wonder if negate sounds like a better argument instead of inverse?

dmi3kno commented 5 years ago

Agree. Let me do this one

tylerlittlefield commented 5 years ago

Great, sounds good 👍