Open dmi3kno opened 5 years ago
One issue I have is that some of these abstract things a step further and possibly makes it difficult to "google" your way out of confusion. For example, if I don't quite understand what "appenders" do, I'm going to search "appenders regular expressions" which won't give better results than "anchors regular expressions". The same thing applies to "expression wrappers" vs. "lookarounds".
I think the API should be abstract, expressive, verbose but the documentation (when possible) a bit more true to regular expression terminology. I think this would be helpful for both novice and advanced users even if advanced users might not use this package in the first place. Having said that, the main changes I'm thinking of would be:
Matches positions. Use these to anchor matches at a certain position.
rx_start_of_line
rx_end_of_line
rx_word_edge
Assert matches. Use these to lookaround matches without consuming them.
rx_avoid
rx_seek
Quantify match repetitions. Use these to quantify how often the expression is matched.
rx_one_or_more
rx_none_or_more
rx_zero_or_one
Which I've added in the dev branchrx_count
If this isn’t very convincing. Maybe we can compromise and go with what you have suggested but make sure to include the formal terminology with references within the individual function docs. Maybe with @details
.
You are absolutely right about search-friendliness of the terms we use. I guess I was just trying to make an inventory of what we have and how these functions behave differently. More from architectural point of view, rather than from user point of view. I like the terms you picked and agree that documentation should be logically organized around those concepts.
What I am still wondering about is how to guide users regarding "nesting" of functions (i.e. what functions are possible to combine together in particular through ...
) and effects of "naked" expressions vs something that goes into (?: )
, ( )
and [ ]
.
Quantifiers
I have a few minor questions about rx_zero_or_one
.
rep="maybe"
is confusing?rx_none_or_more
to be rx_zero_or_more
rex()
syntax. rep="maybe"
makes sense to me. But open to an alternative if you have an idea.rx_zero_or_more
makes sense, is more consistentrx_none_or_one
? Sounds kinda ugly. If everything has a function, would it makesMaybe we just throw out the quantifiers all together in favor of the rep argument? I'm not quite sure what the ramifications of that are but I'm all for trimming down the available functions:
rx() %>% rx_find("abc", rep = "maybe")
#> (?:abc)?
rx() %>% rx_find("abc") %>% rx_zero_or_one()
#> (?:abc)?
Regarding the docs, I'd like to mention that pkgdown
has a desc
argument for adding a subtitle to the reference page headers. Maybe we can use this somehow. The fact that anchors and appenders would be split yet they are technically all "appenders" bothers me a bit. I'm torn on which route to go.
I wonder if we might organize the package into 9 types:
rx_maybe
). Expressions are unfriendly and require rx
constructor (or dot) when nesting expressions.# A tibble: 40 x 3
func type args
<chr> <chr> <chr>
1 rx_end_of_line anchor .data
2 rx_start_of_line anchor .data
3 rx_word_edge anchor .data, negate
4 rx_begin_capture capturing group .data
5 rx_end_capture capturing group .data
6 rx_alpha character class .data, rep, mode, negate
7 rx_alpha_num character class .data, rep, mode, negate
8 rx_alphanum character class .data, rep, mode, negate
9 rx_br character class .data, rep, mode, negate
10 rx_digit character class .data, rep, mode, negate
11 rx_line_break character class .data, rep, mode, negate
12 rx_lowercase character class .data, rep, mode, negate
13 rx_punctuation character class .data, rep, mode, negate
14 rx_space character class .data, rep, mode, negate
15 rx_tab character class .data, rep, mode, negate
16 rx_uppercase character class .data, rep, mode, negate
17 rx_whitespace character class .data, rep, mode, negate
18 rx_word character class .data, mode, negate
19 rx_anything_but expression .data, ..., mode
20 rx_either_of expression .data, ..., rep, mode
21 rx_find expression .data, ..., rep, mode
22 rx_maybe expression .data, ..., mode
23 rx_none_of expression .data, ..., rep, mode
24 rx_one_of expression .data, ..., rep, mode
25 rx_range expression .data, ..., rep, mode, negate
26 rx_something_but expression .data, ..., mode
27 rx_anything friendly expression .data, mode
28 rx_something friendly expression .data, mode
29 rx_avoid_prefix lookaround .data, ...
30 rx_avoid_suffix lookaround .data, ...
31 rx_seek_prefix lookaround .data, ...
32 rx_seek_suffix lookaround .data, ...
33 rx_with_any_case modifier .data
34 rx_count quantifier .data, n, mode
35 rx_none_or_more quantifier .data, mode
36 rx_one_or_more quantifier .data, mode
37 %>% utility lhs, rhs
38 rx utility NA
39 rx_test utility x, txt
40 sanitize utility x
The one I'm most conflicted about is expressions and friendly expressions. My main motivation for defining that function type is to figure out what can be nested and what can't. I think (though there might be an exception or two) that everything else can be nested and plugged in to one another.
It's interesting to see some consistency between the types and arguments, there is of course some inconsistency that needs to be changed (enable argument, value vs. ..., etc). In any case, it makes sense for functions belonging to a specific type to have a somewhat consistent argument structure.
tibble
gist: https://gist.github.com/tyluRp/701d52c25f277764806695f28b5092a8
I have been thinking how to organize package documentation. We basically have a few "groups" of functions that may make sense to be introduced together (at least in pkgdown):
Single-character functions
These are functions that return one character and do not require any "wrappers"
rx_alpha_num
rx_br
andrx_line_break
rx_digit
rx_something
rx_space
rx_tab
rx_whitespace
rx_word_char
andrx_word
(with defaultrep="some"
) argument.Character "sets"
These function output ranges or "sets" of characters, wrapped into
[
, for which we don't have a way to express them with single character. This is important when "nesting" them into supersets below, when "outer" set of[
need to be "peeled off". From the user stand point they may not be any different from Single-character functionsrx_alphanum
rx_alpha
rx_lower
andrx_upper
rx_punctuation
rx_range
"Appenders"
These functions take
.data
argument and simply append something to it, thus modifying the behavior of previously appended function(s).rx_capture_groups
rx_count
rx_end_of_line
andrx_start_of_line
rx_one_or_more
andrx_none_or_more
rx_with_any_case
"Expression-wrappers"
These functions allow user to specify the sequence of characters out of which all should be matched to the string.
rx_avoid
andrx_seek
rx_find
(andrx_literal
, which I now dropped)rx_maybe
(which isrx_find
withrep
argument set to "maybe")rx_or
(which might need a bit of extra work, see #16 and thus will be out of this category)"Superset functions"
These functions specify a list of mutually exclusive symbols/expressions, only one of which should be matched to the string.
rx_one_of
rx_anything_but
andrx_something_but
(eventuallyrx_either_of
) will be moved here as well, if we decide to keep it.I find this grouping helpful when reasoning about the functionality our package covers.
There are a few functions I dropped:
rx_any_of
(duplicate ofrx_one_of
)rx_digits
(too little advantage compared torx_digit(rep=n)
rx_literal
(duplicate ofrx_find
)rx_not
(duplicate ofrx_avoid_suffix
)rx_new
has been moved toutils.R