Closed rsh52 closed 1 month ago
API Thoughts
I could see a user wanting to apply this function to many checkboxes across many instruments. I think we need to give them a way of doing this that's better than calling this function over and over.
Can we generalize to allow multiple checkbox fields in a single instrument to be done in one call?
Agreed and figured that would be the next part of this PR. Should be just a matter of opening up some internal mutates with some grouping/.by
specifiers.
Do we want a function (
map_supertbl()
?) that can apply a transformation iteratively to rows of a supertibble? Possibly with the ability to select specific data tibbles and vary arguments across them.
I think that's a great idea, and can be potentially useful for other functions we come up with (or maybe that users come up with).
I could see a user wanting to apply this function to many checkboxes across many instruments. I think we need to give them a way of doing this that's better than calling this function over and over.
Can we generalize to allow multiple checkbox fields in a single instrument to be done in one call?
@ezraporter Coming back to this I have a couple of thoughts and questions.
I could see giving a custom "tidyselect" option at default for cols
called all_checkboxes()
(or just default this to everything()
and have documentation specify it will be applied to all checkboxes) and, if enabled, have that look for all unique checkbox fields in a given form and combine them. I still think the core function needs to be done on a single form/row of the supertibble, but the mapping function can be what we use to iterate over the supertibble.
For times when users don't want to combine all checkboxes but want to combine more than one, they're most likely to use starts_with()
to grab all checkbox___*
fields (unless we were to look solely at the original field name before the ___
). What are your thoughts for the API for how users would implement multiple tidyselect calls for a single parameter?
For times when users don't want to combine all checkboxes but want to combine more than one, they're most likely to use starts_with() to grab all checkbox* fields (unless we were to look solely at the original field name before the ). What are your thoughts for the API for how users would implement multiple tidyselect calls for a single parameter?
Actually I think we can easily update the internals so that users could just give cols
something like starts_with("race") | starts_with("ethnicity")
. I didn't piece together that eval_select()
was already smart enough to handle it. Still curious about doing a default "everything".
I could see giving a custom "tidyselect" option at default for
cols
calledall_checkboxes()
(or just default this toeverything()
and have documentation specify it will be applied to all checkboxes) and, if enabled, have that look for all unique checkbox fields in a given form and combine them.
Love this idea and I think all_checkboxes()
is the way to go. Better to keep it distinct from everything()
IMO.
I still think the core function needs to be done on a single form/row of the supertibble, but the mapping function can be what we use to iterate over the supertibble.
Agree
Actually I think we can easily update the internals so that users could just give
cols
something likestarts_with("race") | starts_with("ethnicity")
.
My concern here would be that tidyselect
deals with sets of variables and I haven't seen examples where groupings within that set are meaningful. For example, in usages I've seen starts_with("x") | starts_with("y")
is equivalent to matches("^x|y")
. I think we'd be breaking that in our case. (Although maybe that's okay!)
My idea would be to keep the tidyselect
in cols
as just a selector of a set of columns and add other parameters to specify how the column names should be parsed into a name-value pair:
# Some dummy data
data <- tibble(
id = 1,
x___1 = TRUE,
x___2 = NA,
y___1 = NA,
y___2 = TRUE
)
metadata <- tibble(
field_name = c("x___1", "x___2", "y___1", "y___2"),
field_type = "checkbox",
select_choices_or_calculations = "1, A | 2, B"
)
suprtbl <- tibble(
redcap_form_name = "tbl",
redcap_data = list(data),
redcap_metadata = list(metadata)
) |>
as_supertbl()
# Proposed API
combine_checkboxes(
supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x"), starts_with("y")),
sep = "___", # make this the default?
values_to = c("x", "y")
) |>
extract_tibble("tbl")
## A tibble: 1 × 7
# id x___1 x___2 y___1 y___2 x y
# <dbl> <lgl> <lgl> <lgl> <lgl> <fct> <fct>
#1 1 TRUE NA NA TRUE A B
# possibly give a way to infer names (inspired by across(.names = ))
combine_checkboxes(
supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x"), starts_with("y")),
sep = "___",
values_to = "{.col}"
)
# possibly allow for more complex specifications (inspired by separate_wider_regex)
combine_checkboxes(
supertbl = suprtbl,
tbl = "tbl",
cols = c(starts_with("x"), starts_with("y")),
patterns = c(name = ".+", "___", value = ".+"),
values_to = "{.col}"
)
One more thought. In messing around with this I'm not sure how I feel about the values_to
name. The analogy to pivot_longer()
seems more tenuous than I initially thought and I'm wondering if something like names
would be better.
Actually, I think we kind of need to support both logicals and c()
for cols
. Below these amount to the same:
> suprtbl$redcap_data[[1]] %>% select(starts_with("x") | starts_with("y"))
# A tibble: 1 × 4
x___1 x___2 y___1 y___2
<lgl> <lgl> <lgl> <lgl>
1 TRUE NA NA TRUE
> suprtbl$redcap_data[[1]] %>% select(c(starts_with("x"), starts_with("y")))
# A tibble: 1 × 4
x___1 x___2 y___1 y___2
<lgl> <lgl> <lgl> <lgl>
1 TRUE NA NA TRUE
Fortunately only the code below is (currently) really concerned with what gets passed to cols
, and it already works with c()
, it just needs to be updated so that the eval_tidy()
works with the logicals:
But if you feel strongly about banning use of logicals, we'd need to implement some enforced block to users doing so.
I like the API suggestions, I need to give some more thought as to how they would get implemented in practice.
values_to = c("x", "y")
: I assume we would need to set up some check to make sure values_to
and cols
are of equal size?values_to = "{.col}"
: This one I like a lot, essentially converting back to the originally defined names before the REDCap changes?One more thought. In messing around with this I'm not sure how I feel about the values_to name. The analogy to
pivot_longer()
seems more tenuous than I initially thought and I'm wondering if something like names would be better.
Hm, not sold either way. If we're trying to align with pivot_longer()
-inspiration then names
also doesn't really replicate what names_*
does in that function right?
Actually, I think we kind of need to support both logicals and
c()
forcols
.
Sorry, my point wasn't that we shouldn't support this. Just that both methods of selecting columns should be equivalent for the user (as I think they are now).
I like the API suggestions, I need to give some more thought as to how they would get implemented in practice.
values_to = c("x", "y")
: I assume we would need to set up some check to make surevalues_to
andcols
are of equal size?values_to = "{.col}"
: This one I like a lot, essentially converting back to the originally defined names before the REDCap changes?
Some more thoughts that might help with implementation. I would break out what we need to do into 3 steps. The important point is that the result of each step basically gives you everything you need to carry out the rest of the data transformation.
Step 1: Select a bunch of checkbox fields from the data.
x___1
, x___2
, y___1
, y___2
Step 2: Convert the result into a representation that maps checkbox field names to checkbox field values.
col | value | name |
---|---|---|
x | 1 | x___1 |
x | 2 | x___2 |
y | 1 | y___1 |
y | 2 | y___2 |
Step 3: Apply our existing logic to each group created by col
API-wise, you can sum up my position as: separate parameters should be responsible for doing step 1 and step 2. That is, cols
just does step 1 and tells you nothing about step 2.
I think the representation in step 2 is kind of nice because all the validation can happen there:
name
is a checkbox fieldvalues_to
has the same number of elements as groups of col
In this light, the parameters I have in my API suggestion above are just helpers to let the user define how we go from step 1 to step 2.
Hm, not sold either way. If we're trying to align with
pivot_longer()
-inspiration thennames
also doesn't really replicate whatnames_*
does in that function right?
Yeah, this comment was just to say I'm not sold either now so don't stick with values_to
because you think I love it 😄. The pivot_longer()
connection seems less strong now that I've thought about the API more.
@ezraporter Check out the newer API when you have a moment. Probably still a bit more to go, but I think this addresses the 3 steps you outlined above. Changes made:
values_to
in favor of a combination of names_prefix
, names_suffix
, and names_sep
(which also draw from pivot_*()
). I think this is a better approach and does a better job of ensuring there's no mismatch in the number of elements between the supplied checkbox fields and what they get converted to. Things still to do pending your thoughts:
all_checkboxes()
as an option to the cols
paramnames_glue
parameter similar to pivot_*()
pivot_*()
doesn't have to do this and it's much simpler to work with. Not saying no to it, but want to make sure the value is there for including it.Also confirming that the supertibble output works as expected when using make_labelled()
.
When applying make_labelled()
on a modified supertibble using combine_checkboxes()
all labels are created as usual, there just aren't any new ones for the newly created columns.
Description
This PR seeks to add out first "analytics" function as a tool for data analysts working with the supertibble after it's been created.
This first function,
combine_checkboxes()
, seeks to take the wide-form output of checkbox fields in a data tibble and do the following:TRUE/FALSE
/1/0
to showing:raw
orlabel
values associated with the checkbox if a single value is selectedmulti_value_label
)values_fill
)I implemented some additional things like
check_*
style functions, but interested in thoughts, feedback, etc. I figure this will go through a few iterations before being finalized.Proposed Changes
List changes below in bullet format:
combine_checkboxes()
check_fields_exist()
) and provide helpful error messagecheckbox
field types (check_fields_are_checkboxes()
) and provide helpful error messagecombine_checkboxes()
to pkgdown siteAdditional Changes:
revdepcheck
is no longer maintained, so I'm trying out a new workflow fromr-devel
called recheck. Another option is to usetools::check_packages_in_dir()
.revdepcheck
folder to stoprenv
issuesrenv
lockfile updated as wellIssue Addressed
Closes #194
PR Checklist
Before submitting this PR, please check and verify below that the submission meets the below criteria:
.RDS
) updated underinst/testdata/create_test_data.R
usethis::use_version()
Code Review
This section to be used by the reviewer and developers during Code Review after PR submission
Code Review Checklist