Organizing package documentation

I have been thinking how to organize package documentation. We basically have a few "groups" of functions that may make sense to be introduced together (at least in pkgdown):

Single-character functions

These are functions that return one character and do not require any "wrappers"

rx_alpha_num
rx_br and rx_line_break
rx_digit
rx_something
rx_space
rx_tab
rx_whitespace
rx_word_char and rx_word (with default rep="some") argument.

Character "sets"

These function output ranges or "sets" of characters, wrapped into [, for which we don't have a way to express them with single character. This is important when "nesting" them into supersets below, when "outer" set of [ need to be "peeled off". From the user stand point they may not be any different from Single-character functions

rx_alphanum
rx_alpha
rx_lower and rx_upper
rx_punctuation
rx_range

"Appenders"

These functions take .data argument and simply append something to it, thus modifying the behavior of previously appended function(s).

rx_capture_groups
rx_count
rx_end_of_line and rx_start_of_line
rx_one_or_more and rx_none_or_more
rx_with_any_case

"Expression-wrappers"

These functions allow user to specify the sequence of characters out of which all should be matched to the string.

rx_avoid and rx_seek
rx_find (and rx_literal, which I now dropped)
rx_maybe (which is rx_find with rep argument set to "maybe")
rx_or (which might need a bit of extra work, see #16 and thus will be out of this category)

"Superset functions"

These functions specify a list of mutually exclusive symbols/expressions, only one of which should be matched to the string.

rx_one_of
rx_anything_but and rx_something_but (eventually rx_either_of) will be moved here as well, if we decide to keep it.

I find this grouping helpful when reasoning about the functionality our package covers.

There are a few functions I dropped: rx_any_of (duplicate of rx_one_of) rx_digits (too little advantage compared to rx_digit(rep=n) rx_literal (duplicate of rx_find) rx_not (duplicate of rx_avoid_suffix) rx_new has been moved to utils.R

One issue I have is that some of these abstract things a step further and possibly makes it difficult to "google" your way out of confusion. For example, if I don't quite understand what "appenders" do, I'm going to search "appenders regular expressions" which won't give better results than "anchors regular expressions". The same thing applies to "expression wrappers" vs. "lookarounds".

I think the API should be abstract, expressive, verbose but the documentation (when possible) a bit more true to regular expression terminology. I think this would be helpful for both novice and advanced users even if advanced users might not use this package in the first place. Having said that, the main changes I'm thinking of would be:

Anchors

Matches positions. Use these to anchor matches at a certain position.

rx_start_of_line
rx_end_of_line
rx_word_edge

Lookarounds

Assert matches. Use these to lookaround matches without consuming them.

rx_avoid
rx_seek

Quantifiers

Quantify match repetitions. Use these to quantify how often the expression is matched.

rx_one_or_more
rx_none_or_more
rx_zero_or_one Which I've added in the dev branch
rx_count

If this isn’t very convincing. Maybe we can compromise and go with what you have suggested but make sure to include the formal terminology with references within the individual function docs. Maybe with @details.

You are absolutely right about search-friendliness of the terms we use. I guess I was just trying to make an inventory of what we have and how these functions behave differently. More from architectural point of view, rather than from user point of view. I like the terms you picked and agree that documentation should be logically organized around those concepts.

What I am still wondering about is how to guide users regarding "nesting" of functions (i.e. what functions are possible to combine together in particular through ...) and effects of "naked" expressions vs something that goes into (?: ), ( ) and [ ].

Quantifiers

I have a few minor questions about rx_zero_or_one.

I see that it is synctatic sugar over rep="maybe". Do you feel rep="maybe" is confusing?
Shouldn't we rename rx_none_or_more to be rx_zero_or_more
This is getting dangerously close to rex() syntax.

rep="maybe" makes sense to me. But open to an alternative if you have an idea.
You're right, rx_zero_or_more makes sense, is more consistent
Right again... I don't want to reinvent the wheel. Maybe rx_none_or_one? Sounds kinda ugly. If everything has a function, would it makes

Maybe we just throw out the quantifiers all together in favor of the rep argument? I'm not quite sure what the ramifications of that are but I'm all for trimming down the available functions:

rx() %>% rx_find("abc", rep = "maybe")
#> (?:abc)?

rx() %>% rx_find("abc") %>% rx_zero_or_one()
#> (?:abc)?

Regarding the docs, I'd like to mention that pkgdown has a desc argument for adding a subtitle to the reference page headers. Maybe we can use this somehow. The fact that anchors and appenders would be split yet they are technically all "appenders" bothers me a bit. I'm torn on which route to go.

I wonder if we might organize the package into 9 types:

anchors: Matches position.
capturing groups: Matches groups.
character classes: Matches predefined sets of characters wrapped between brackets.
expressions: These are like combinations of all the other things, i.e. non capture group paired with a quantifier (rx_maybe). Expressions are unfriendly and require rx constructor (or dot) when nesting expressions.
friendly expressions: Similar to expressions but they are friendly in the sense that you can plug these into expressions.
look-arounds: Look-around or behind matches, matches things without consuming them.
modifiers: Modifies the expression. Right now theres only one and it modifier the entire expression, this may need to be changed to only modify a specific expression instead of making it a global change.
quantifiers: Controls repetition of matches.
utility: Utility functions that users don't really need to worry about.

# A tibble: 40 x 3
   func             type                args                         
   <chr>            <chr>               <chr>                        
 1 rx_end_of_line   anchor              .data                        
 2 rx_start_of_line anchor              .data                        
 3 rx_word_edge     anchor              .data, negate                
 4 rx_begin_capture capturing group     .data                        
 5 rx_end_capture   capturing group     .data                        
 6 rx_alpha         character class     .data, rep, mode, negate     
 7 rx_alpha_num     character class     .data, rep, mode, negate     
 8 rx_alphanum      character class     .data, rep, mode, negate     
 9 rx_br            character class     .data, rep, mode, negate     
10 rx_digit         character class     .data, rep, mode, negate     
11 rx_line_break    character class     .data, rep, mode, negate     
12 rx_lowercase     character class     .data, rep, mode, negate     
13 rx_punctuation   character class     .data, rep, mode, negate     
14 rx_space         character class     .data, rep, mode, negate     
15 rx_tab           character class     .data, rep, mode, negate     
16 rx_uppercase     character class     .data, rep, mode, negate     
17 rx_whitespace    character class     .data, rep, mode, negate     
18 rx_word          character class     .data, mode, negate          
19 rx_anything_but  expression          .data, ..., mode             
20 rx_either_of     expression          .data, ..., rep, mode        
21 rx_find          expression          .data, ..., rep, mode        
22 rx_maybe         expression          .data, ..., mode             
23 rx_none_of       expression          .data, ..., rep, mode        
24 rx_one_of        expression          .data, ..., rep, mode        
25 rx_range         expression          .data, ..., rep, mode, negate
26 rx_something_but expression          .data, ..., mode             
27 rx_anything      friendly expression .data, mode                  
28 rx_something     friendly expression .data, mode                  
29 rx_avoid_prefix  lookaround          .data, ...                   
30 rx_avoid_suffix  lookaround          .data, ...                   
31 rx_seek_prefix   lookaround          .data, ...                   
32 rx_seek_suffix   lookaround          .data, ...                   
33 rx_with_any_case modifier            .data                        
34 rx_count         quantifier          .data, n, mode               
35 rx_none_or_more  quantifier          .data, mode                  
36 rx_one_or_more   quantifier          .data, mode                  
37 %>%              utility             lhs, rhs                     
38 rx               utility             NA                           
39 rx_test          utility             x, txt                       
40 sanitize         utility             x

The one I'm most conflicted about is expressions and friendly expressions. My main motivation for defining that function type is to figure out what can be nested and what can't. I think (though there might be an exception or two) that everything else can be nested and plugged in to one another.

It's interesting to see some consistency between the types and arguments, there is of course some inconsistency that needs to be changed (enable argument, value vs. ..., etc). In any case, it makes sense for functions belonging to a specific type to have a somewhat consistent argument structure.

tibble gist: https://gist.github.com/tyluRp/701d52c25f277764806695f28b5092a8

VerbalExpressions / RVerbalExpressions