CHOP-CGTInformatics / REDCapTidieR

Makes it easy to read REDCap Projects into R
https://chop-cgtinformatics.github.io/REDCapTidieR/
Other
32 stars 8 forks source link

combine_checkboxes #196

Closed rsh52 closed 1 month ago

rsh52 commented 2 months ago

Description

This PR seeks to add out first "analytics" function as a tool for data analysts working with the supertibble after it's been created.

This first function, combine_checkboxes(), seeks to take the wide-form output of checkbox fields in a data tibble and do the following:

I implemented some additional things like check_* style functions, but interested in thoughts, feedback, etc. I figure this will go through a few iterations before being finalized.

Proposed Changes

List changes below in bullet format:

Additional Changes:

Issue Addressed

Closes #194

PR Checklist

Before submitting this PR, please check and verify below that the submission meets the below criteria:

Code Review

This section to be used by the reviewer and developers during Code Review after PR submission

Code Review Checklist


rsh52 commented 1 month ago

API Thoughts

I could see a user wanting to apply this function to many checkboxes across many instruments. I think we need to give them a way of doing this that's better than calling this function over and over.

Can we generalize to allow multiple checkbox fields in a single instrument to be done in one call?

Agreed and figured that would be the next part of this PR. Should be just a matter of opening up some internal mutates with some grouping/.by specifiers.

Do we want a function (map_supertbl()?) that can apply a transformation iteratively to rows of a supertibble? Possibly with the ability to select specific data tibbles and vary arguments across them.

I think that's a great idea, and can be potentially useful for other functions we come up with (or maybe that users come up with).

rsh52 commented 1 month ago

I could see a user wanting to apply this function to many checkboxes across many instruments. I think we need to give them a way of doing this that's better than calling this function over and over.

Can we generalize to allow multiple checkbox fields in a single instrument to be done in one call?

@ezraporter Coming back to this I have a couple of thoughts and questions.

I could see giving a custom "tidyselect" option at default for cols called all_checkboxes() (or just default this to everything() and have documentation specify it will be applied to all checkboxes) and, if enabled, have that look for all unique checkbox fields in a given form and combine them. I still think the core function needs to be done on a single form/row of the supertibble, but the mapping function can be what we use to iterate over the supertibble.

For times when users don't want to combine all checkboxes but want to combine more than one, they're most likely to use starts_with() to grab all checkbox___* fields (unless we were to look solely at the original field name before the ___). What are your thoughts for the API for how users would implement multiple tidyselect calls for a single parameter?

rsh52 commented 1 month ago

For times when users don't want to combine all checkboxes but want to combine more than one, they're most likely to use starts_with() to grab all checkbox* fields (unless we were to look solely at the original field name before the ). What are your thoughts for the API for how users would implement multiple tidyselect calls for a single parameter?

Actually I think we can easily update the internals so that users could just give cols something like starts_with("race") | starts_with("ethnicity"). I didn't piece together that eval_select() was already smart enough to handle it. Still curious about doing a default "everything".

ezraporter commented 1 month ago

I could see giving a custom "tidyselect" option at default for cols called all_checkboxes() (or just default this to everything() and have documentation specify it will be applied to all checkboxes) and, if enabled, have that look for all unique checkbox fields in a given form and combine them.

Love this idea and I think all_checkboxes() is the way to go. Better to keep it distinct from everything() IMO.

I still think the core function needs to be done on a single form/row of the supertibble, but the mapping function can be what we use to iterate over the supertibble.

Agree

Actually I think we can easily update the internals so that users could just give cols something like starts_with("race") | starts_with("ethnicity").

My concern here would be that tidyselect deals with sets of variables and I haven't seen examples where groupings within that set are meaningful. For example, in usages I've seen starts_with("x") | starts_with("y") is equivalent to matches("^x|y"). I think we'd be breaking that in our case. (Although maybe that's okay!)

My idea would be to keep the tidyselect in cols as just a selector of a set of columns and add other parameters to specify how the column names should be parsed into a name-value pair:

# Some dummy data

data <- tibble(
  id = 1,
  x___1 = TRUE,
  x___2 = NA,
  y___1 = NA,
  y___2 = TRUE
)

metadata <- tibble(
  field_name = c("x___1", "x___2", "y___1", "y___2"),
  field_type = "checkbox",
  select_choices_or_calculations = "1, A | 2, B"
)

suprtbl <- tibble(
  redcap_form_name = "tbl",
  redcap_data = list(data),
  redcap_metadata = list(metadata)
) |>
  as_supertbl()

# Proposed API

combine_checkboxes(
  supertbl = suprtbl,
  tbl = "tbl",
  cols = c(starts_with("x"), starts_with("y")),
  sep = "___", # make this the default?
  values_to = c("x", "y")
) |>
  extract_tibble("tbl")

## A tibble: 1 × 7
#     id x___1 x___2 y___1 y___2 x     y    
#  <dbl> <lgl> <lgl> <lgl> <lgl> <fct> <fct>
#1     1 TRUE  NA    NA    TRUE  A     B    

# possibly give a way to infer names (inspired by across(.names = ))
combine_checkboxes(
  supertbl = suprtbl,
  tbl = "tbl",
  cols = c(starts_with("x"), starts_with("y")),
  sep = "___",
  values_to = "{.col}"
)

# possibly allow for more complex specifications (inspired by separate_wider_regex)
combine_checkboxes(
  supertbl = suprtbl,
  tbl = "tbl",
  cols = c(starts_with("x"), starts_with("y")),
  patterns = c(name = ".+", "___", value = ".+"),
  values_to = "{.col}"
)
ezraporter commented 1 month ago

One more thought. In messing around with this I'm not sure how I feel about the values_to name. The analogy to pivot_longer() seems more tenuous than I initially thought and I'm wondering if something like names would be better.

rsh52 commented 1 month ago

Actually, I think we kind of need to support both logicals and c() for cols. Below these amount to the same:

> suprtbl$redcap_data[[1]] %>% select(starts_with("x") | starts_with("y"))
# A tibble: 1 × 4
  x___1 x___2 y___1 y___2
  <lgl> <lgl> <lgl> <lgl>
1 TRUE  NA    NA    TRUE 
> suprtbl$redcap_data[[1]] %>% select(c(starts_with("x"), starts_with("y")))
# A tibble: 1 × 4
  x___1 x___2 y___1 y___2
  <lgl> <lgl> <lgl> <lgl>
1 TRUE  NA    NA    TRUE 

Fortunately only the code below is (currently) really concerned with what gets passed to cols, and it already works with c(), it just needs to be updated so that the eval_tidy() works with the logicals:

https://github.com/CHOP-CGTInformatics/REDCapTidieR/blob/2dfac9a2261113eb57bd40b41d97ea277c4ece0a/R/combine_checkboxes.R#L64-L76

But if you feel strongly about banning use of logicals, we'd need to implement some enforced block to users doing so.


I like the API suggestions, I need to give some more thought as to how they would get implemented in practice.


One more thought. In messing around with this I'm not sure how I feel about the values_to name. The analogy to pivot_longer() seems more tenuous than I initially thought and I'm wondering if something like names would be better.

Hm, not sold either way. If we're trying to align with pivot_longer()-inspiration then names also doesn't really replicate what names_* does in that function right?

ezraporter commented 1 month ago

Actually, I think we kind of need to support both logicals and c() for cols.

Sorry, my point wasn't that we shouldn't support this. Just that both methods of selecting columns should be equivalent for the user (as I think they are now).

I like the API suggestions, I need to give some more thought as to how they would get implemented in practice.

  • values_to = c("x", "y"): I assume we would need to set up some check to make sure values_to and cols are of equal size?
  • values_to = "{.col}": This one I like a lot, essentially converting back to the originally defined names before the REDCap changes?

Some more thoughts that might help with implementation. I would break out what we need to do into 3 steps. The important point is that the result of each step basically gives you everything you need to carry out the rest of the data transformation.

Step 1: Select a bunch of checkbox fields from the data.

x___1, x___2, y___1, y___2

Step 2: Convert the result into a representation that maps checkbox field names to checkbox field values.

col value name
x 1 x___1
x 2 x___2
y 1 y___1
y 2 y___2

Step 3: Apply our existing logic to each group created by col

API-wise, you can sum up my position as: separate parameters should be responsible for doing step 1 and step 2. That is, cols just does step 1 and tells you nothing about step 2.

I think the representation in step 2 is kind of nice because all the validation can happen there:

In this light, the parameters I have in my API suggestion above are just helpers to let the user define how we go from step 1 to step 2.

Hm, not sold either way. If we're trying to align with pivot_longer()-inspiration then names also doesn't really replicate what names_* does in that function right?

Yeah, this comment was just to say I'm not sold either now so don't stick with values_to because you think I love it 😄. The pivot_longer() connection seems less strong now that I've thought about the API more.

rsh52 commented 1 month ago

@ezraporter Check out the newer API when you have a moment. Probably still a bit more to go, but I think this addresses the 3 steps you outlined above. Changes made:

Things still to do pending your thoughts:

rsh52 commented 1 month ago

Also confirming that the supertibble output works as expected when using make_labelled().

When applying make_labelled() on a modified supertibble using combine_checkboxes() all labels are created as usual, there just aren't any new ones for the newly created columns.