Closed rsh52 closed 1 month ago
The stuff in https://github.com/CHOP-CGTInformatics/REDCapTidieR/commit/96c3309959caa5d574ae192604ce86ceef04bea1 looks like a good start. I didn't look closely at the code and mainly focused on the API.
My biggest reaction is that the current API take a supertibble as input and returns a tibble and I'm not sure that's right. I think we want data manipulation functions like this to be pipe-friendly which requires that inputs and outputs are the same type of thing.
I think the best way of doing this is to have this function modify a tibble within the supertibble and return the whole thing. To get the current behavior you'd do need to do something like:
supertbl |>
reduce_multi_to_single_column() |>
extract_tibble()
but this buys us composability:
supertbl |>
reduce_multi_to_single_column() |>
reduce_multi_to_single_column() |>
some_other_transformation() |>
...
Maybe we should call this unite_checkbox()
in reference to tidyr::unite()
?
For parameter names what about:
cols_to
-> values_to
no_val
-> values_fill
multi_val
-> multi_value_label
The first 2 are inspired by pivot_*
naming conventions in tidyr
Hi,
There isn't a general rule that says that the pipe should always have inputs and outputs that are the same type of thing. That's more of a dplyr-specific rule that's not shared by other tidy packages such as e.g. tidymodels.
That said, I agree with your point and actually think this function should take a tibble and return a modified tibble. So the workflow would be
supertbl |> extract_tibble(x) |> reduce_multi_to_single_column(...)
I think "unite" is much less specific than "reduce_multi_to_single_column".
My thoughts! S
From: Ezra Porter @.> Sent: Friday, July 12, 2024 2:43 PM To: CHOP-CGTInformatics/REDCapTidieR @.> Cc: Subscribed @.***> Subject: [External]Re: [CHOP-CGTInformatics/REDCapTidieR] [FEATURE] Multiple Choice to Single Column Function (Issue #194)
The stuff in 96c3309https://github.com/CHOP-CGTInformatics/REDCapTidieR/commit/96c3309959caa5d574ae192604ce86ceef04bea1 looks like a good start. I didn't look closely at the code and mainly focused on the API.
My biggest reaction is that the current API take a supertibble as input and returns a tibble and I'm not sure that's right. I think we want data manipulation functions like this to be pipe-friendly which requires that inputs and outputs are the same type of thing.
I think the best way of doing this is to have this function modify a tibble within the supertibble and return the whole thing. To get the current behavior you'd do need to do something like:
supertbl |> reduce_multi_to_single_column() |> extract_tibble()
but this buys us composability:
supertbl |> reduce_multi_to_single_column() |> reduce_multi_to_single_column() |> some_other_transformation() |> ...
Naming thoughts
Maybe we should call this unite_checkbox() in reference to tidyr::unite()?
For parameter names what about: cols_to -> values_to no_val -> values_fill multi_val -> multi_value_label
The first 2 are inspired by pivot_* naming conventions in tidyr
— Reply to this email directly, view it on GitHubhttps://github.com/CHOP-CGTInformatics/REDCapTidieR/issues/194#issuecomment-2226171490, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACTGHWXWPZPXGODN5JDCPR3ZMAPVTAVCNFSM6AAAAABIRVH3YGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWGE3TCNBZGA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
This email originated from an EXTERNAL sender to CHOP. Proceed with caution when replying, opening attachments, or clicking links. Do not disclose your CHOP credentials, employee information, or protected health information to a potential hacker.
My biggest reaction is that the current API take a supertibble as input and returns a tibble and I'm not sure that's right. I think we want data manipulation functions like this to be pipe-friendly which requires that inputs and outputs are the same type of thing.
For the API change, I think this is easily reworkable if we agree on what it should intake and output. I definitely see what you're saying. My concern with returning a supertibble is that you don't actually see the changes from the output of the function, it's sort of "masked" inside of the data tibbles. But maybe that's not a big issue here?
Either way the function needs access to the metadata raw/label values associated with the checkboxes to be united, so I don't see much way of not having users supply the supertibble. Otherwise they'd have to supply the data tibble and metadata tibble separately.
Naming thoughts
I like the naming much better, these all make sense to me.
That said, I agree with your point and actually think this function should take a tibble and return a modified tibble. So the workflow would be supertbl |> extract_tibble(x) |> reduce_multi_to_single_column(...)
@skadauke This also makes some sense, but then users would still need to supply the metadata separately. That's what led me to wrapping the extract_()
internally.
What metadata is needed for the transformation?
From: Rich Hanna @.> Sent: Friday, July 12, 2024 2:56 PM To: CHOP-CGTInformatics/REDCapTidieR @.> Cc: Stephan Kadauke @.>; Mention @.> Subject: [External]Re: [CHOP-CGTInformatics/REDCapTidieR] [FEATURE] Multiple Choice to Single Column Function (Issue #194)
That said, I agree with your point and actually think this function should take a tibble and return a modified tibble. So the workflow would be supertbl |> extract_tibble(x) |> reduce_multi_to_single_column(...)
@skadaukehttps://github.com/skadauke This also makes some sense, but then users would still need to supply the metadata separately. That's what led me to wrapping the extract_() internally.
— Reply to this email directly, view it on GitHubhttps://github.com/CHOP-CGTInformatics/REDCapTidieR/issues/194#issuecomment-2226187848, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACTGHWTM263RHLQ7ROGOLXTZMARFFAVCNFSM6AAAAABIRVH3YGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRWGE4DOOBUHA. You are receiving this because you were mentioned.Message ID: @.***>
This email originated from an EXTERNAL sender to CHOP. Proceed with caution when replying, opening attachments, or clicking links. Do not disclose your CHOP credentials, employee information, or protected health information to a potential hacker.
What metadata is needed for the transformation?
The general rule is to consolidate checkboxes under one column, showing the raw/label value associated with the checkbox if only one value is selected OR a custom value (i.e. "multiple" / "many") if multiple selected. There's no way to grab these values in the data tibble, they are either 1s and 0s or TRUEs and FALSEs. Ex:
> nonrepeat_data
# A tibble: 3 × 4
study_id multi___1 multi___2 multi___3
<dbl> <lgl> <lgl> <lgl>
1 1 TRUE FALSE FALSE
2 2 TRUE TRUE FALSE
3 3 FALSE FALSE FALSE
> nonrepeat_metadata
# A tibble: 4 × 2
field_name select_choices_or_calculations
<chr> <chr>
1 study_id NA
2 multi___1 1, Red | 2, Yellow | 3, Blue
3 multi___2 1, Red | 2, Yellow | 3, Blue
4 multi___3 1, Red | 2, Yellow | 3, Blue
Feature Request Description
It can be a common use case that multiple choice fields (i.e. checkbox fields) need to be consolidated into a single column, such as when powering Table 1s for manuscript reporting.
Proposed Solution
This function, placeholder name
reduce_multi_to_single_column()
will be the first REDCapTidieR analytic tool that users can implement on columns in their extracted tibbles.It should:
starts_with("race")
)Additional Context
This was prompted by the request in #192 and should be a more generalizable solution.
Checklist