Closed dmi3kno closed 5 years ago
Agree, these need to be added. rx_whitespace
, rx_tab
exist as well. These use the \\s
style instead of R's [[:space:]]
style. We should decide which style we prefer. Not sure if there are benefits, consequences of either.
rx_digit()
rx_alnum()
rx_alpha()
rx_lowercase()
rx_uppercase()
rx_space()
rx_punctuation()
rx_whitespace()
rx_tab()
The stuff checked off still needs a bit more tests. Will add more later.
Could we add inverse
argument for those functions where we have an inverse regex expression? What I am talking about is instead of ìmplementing rx_non_whitespace
we implement rx_whitespace(inverse=TRUE)
even though all of these functions will have inverse set to FALSE
by default. I am a little unprepared to deal with !
and I feel number of functions is getting a little out of hand.
Let's try to stick with \\s
style regex, where possible.
Sounds good to me, the less functions the better! I will start adding the inverse arg later today.
As of f4c27ff, most character classes (except upper and lowercase) have the inverse argument. I figured they aren't needed because if you didn't want lowercase, it would make more sense to use rx_uppercase()
instead of rx_lowercase(inverse = TRUE)
?
Another thought came to my mind regarding rx_something_but()
and rx_anything_but()
these are similar to inverse argument, for example if you did rx_something_but(rx_alpha())
except a +
would be added and the expression enclosed between a none capturing group like so (?:[^[:alpha:]+)
whereas rx_alpha(inverse = TRUE)
gives [^[:alpha:]]
. I still don't really understand groups but I thought I would bring this up.
"Anything, but lower case" is not the same as "uppercase". You have numbers and special symbols. Uppercase will lock you to [A-Z]
regarding
rx_something_but()
andrx_anything_but()
These are indeed similar, except that they "wrap" the group. We could call rx_something_but()
a rx_none_of(rep="some")
. They are most likely going to get the same ...
interface as rx_one_of
and allow heterogeneous argument sets like:
# I want to exclude matching N09847newest and N83678old
rx_anything_but("N", rx_digit(rep="some"), rx_alpha(rep="some"))
Your individual inverse=TRUE
are useful in pipes, when you want to designate a character via-negativa
, e.g. "a symbol other than uppercase here". Yes, you can everytime wrap positive rx
statement into rx_anything_but
, but that would make pipes ugly.
one_of
and none_of
(which anything_but
is) are specially reserved for "grouped" operations, where you want to exclude a sequence of different statements (mix of literals and expressions), IMHO
Ah, I see. Inverse added to upper/lower case functions.
@dmi3kno When you get a chance, please review the inverse rules for character classes. I've just reverted them to the \\s
style instead of R's [[:space:]]
. There were a couple of things that I'm a little stuck on:
rx_digits
inverse gives [^\\d", "{", n, "}]
but this doesn't seem to work as expected. Should digits have an inverse argument?
rx_space
if inverse is FALSE, it matches " "
(space) and if TRUE it matches [^ ]
(not space). Because it seems that \\s
and \\S
include more than just a space (newlines, carriage returns, tabs).
I have droppedrx_digits
in my branch.Now that we have compact rep
argument, this function is a little redundant.
I will look at the rx_space
again. It might be that rx_whitespace
is enough and the rest is achievable with rx_literal
I wonder if negate sounds like a better argument instead of inverse?
Agree. Let me do this one
Great, sounds good 👍
Should we add generic character class helpers: