Open ezraporter opened 2 months ago
Never got around to giving you a good response, but looking over this now with some thoughts.
So far as the API goes, originally I was going to say return a supertibble since that's what the new combined_checkboxes()
does (based on the idea of it gets a supertibble, so it returns a supertibble). But on second thought, I don't know that I want this function doing the heavy lifting of appending/editing the supertibble. What would the output look like if we returned a supertibble, would you have a new row added that joined the two? Would it then also need to update the metadata tibble and remaining analytics columns? Or would it combine the two original columns (and still need to update the metadata and remaining columns)?
IMO the easiest thing to start with here is returning just a data tibble. I think that's the easiest for a user to visually confirm and do what they wish with, and if we do name the function join_data_tibbles()
(I think this works for the name) then it's not unreasonable to expect a data tibble output.
The one complication I see is the NA redcap_event_instance in the "NR, RT" case. ... Presumably if you join "NR, RT" to "NR, RT" you want the NAs to join in the non repeating event which is a little non-standard.
This is pseudo-non standard if I understand right. extract_keys()
or some intermediate step would give us the relationship/type identifier. Could we just use a helper column from that to join on instead of joining on NA
? I think that's going to be the important piece for all of this including the RS/RT to RS/RT joins. That type designation would help funnel which of the 28 operations to enforce.
My hope would be if we approach this as tackling each relationship one at a time then we can just build this up piece by piece.
@ezraporter Started working on this a little over here.
When you have a moment care to take a look and see if it's structured in a way that makes sense so far, borrowing from your initial suggestions? My hope and first milestone I want to reach is that this works for all non-mixed join relationships (I believe it does) and that the next piece will be to insert logic where is_mixed
is true.
This looks like a good start! Just a few random thoughts:
I'm fine with approaching it non-mixed then mixed but I'm not convinced the final code should follow that structure. Once we consider the mixed case there might be a larger pattern that doesn't split along the mixed/non-mixed line and we should be looking for that. So I say keep going this way but don't get stuck in the distinction.
If we go the route of adding attributes to the tibbles like pks
and structure
we should a) think a bit more about what attributes we want to include and consider making an actual class and b) do those computations in read_redcap()
. I'm not fully convinced we need more than functional programming here though.
I'm not sure the rlang::exec
is necessary. If it's helpful here's what popped into my head:
join_data_tibbles <- function(...) {
join_fn <- get_join_fn(type) # returns a function, ex. `left_join`
tbl_x <- extract_tibble(supertbl, x)
tbl_x_type <- get_structure(tbl_x) # returns a string with the structure, ex. "mixed"
tbl_y <- extract_tibble(supertbl, y)
tbl_y_type <- get_structure(tbl_y)
by <- build_by(tbl_x, tbl_x_type, tbl_y, tbl_y_type) # returns a vector of keys, ex. c(`record_id_field`, "redcap_event")
# do the join!
join_fn(tbl_x, tbl_y, by = by)
}
This looks like a good start! Just a few random thoughts:
- I'm fine with approaching it non-mixed then mixed but I'm not convinced the final code should follow that structure. Once we consider the mixed case there might be a larger pattern that doesn't split along the mixed/non-mixed line and we should be looking for that. So I say keep going this way but don't get stuck in the distinction.
For sure, and I'm sure we can get there but for now, at least the way my brain works, I wanted to approach things stepwise.
- If we go the route of adding attributes to the tibbles like
pks
andstructure
we should a) think a bit more about what attributes we want to include and consider making an actual class and b) do those computations inread_redcap()
. I'm not fully convinced we need more than functional programming here though.
Yea I need to think about this more since between this and record_status_dashboard()
I want to make a few more things accessible from the supertibble.
- I'm not sure the
rlang::exec
is necessary. If it's helpful here's what popped into my head:join_data_tibbles <- function(...) { join_fn <- get_join_fn(type) # returns a function, ex. `left_join` tbl_x <- extract_tibble(supertbl, x) tbl_x_type <- get_structure(tbl_x) # returns a string with the structure, ex. "mixed" tbl_y <- extract_tibble(supertbl, y) tbl_y_type <- get_structure(tbl_y) by <- build_by(tbl_x, tbl_x_type, tbl_y, tbl_y_type) # returns a vector of keys, ex. c(`record_id_field`, "redcap_event") # do the join! join_fn(tbl_x, tbl_y, by = by) }
I like this set up, will restructure to accommodate 👍
Presumably if you join "NR, RT" to "NR, RT" you want the NAs to join in the non repeating event which is a little non-standard. Things get a little more complicated when considering the mixed structure instruments. For example, suppose you're joining "RS, RT" to "RS, RT". Presumably you'd want to join on all the keys for records in the RT event but on only record_id and redcap_event for the RS records. If we wanted to cover all our bases we could try to enumerate the keys we'd expect to join on for all 28 possible combinations.
Coming back to this, I'm wondering if this can actually be made simpler? During the joins if we artificially create a redcap_event_instance
and redcap_form_instance
for the tables that don't have them and fill them with NA
then I believe this should cover things. It would make for some weird outputs, like empty data for left and right joins but would have all of the expected data for full joins which I think users would want to specify anyway.
Going to test this a bit more and hope to update the branch.
@ezraporter When you have a moment, can you trial out the code on the join-tibbles-branch? If you want I can write up a test suite & PR if that makes it easier to review, but thought it might still be too soon to do that.
The main changes I made for mixed structure handling were to check the x and y tables for the existence of redcap_event_instance
/ redcap_form_instance
and add them in if missing. This way joining with NA
is meaningful when there is a non-repeat table.
As I mentioned above, when performing the mixed structure joins I think it's more common that users will need to select full_join()
for their operation if they want to get all the observations from both tables. Otherwise left and right are just going to tack on empty columns from the opposite table.
Try out below and playing with the type
arg:
sprtbl <- read_redcap(Sys.getenv("REDCAP_URI"), token = Sys.getenv("REDCAPTIDIER_MIXED_STRUCTURE_API"),
allow_mixed_structure = TRUE)
join_data_tibbles(suprtbl = sprtbl, x = "nonrepeat_form", y = "mixed_structure_form")
join_data_tibbles(suprtbl = sprtbl, x = "nonrepeat_form", y = "repeat_form")
join_data_tibbles(suprtbl = sprtbl, x = "repeat_form", y = "mixed_structure_form")
At the moment if you join the RT/RS tables (the third join) there will be an empty redcap_event_instance
column leftover. This probably needs to be dropped since it's confusingly not meaningful for repeating events with our set up, but wanted to get your thoughts.
Feature Request Description
Function to join together data tibbles that picks smart default keys to join on based on the events each tibble is mapped to.
Proposed Solution
Some possible APIs:
Additional Context
Data tibble types
API issues aside, I think the key here is that we'd need a way to determine for any two data tibbles which keys to use in a join by default.
Each instrument-event pair can have one of 3 types: Non Repeating, Repeating Separately, or Repeating Together. When I talk about the "type" of an instrument, I mean the aggregation of all the types of instrument-event pairs the instrument is part of. There are 7 options which I mocked up in an example REDCap.
"Type" here is slightly more specific than "structure". I think for joining any two non-mixed instruments the default should be to join on all common keys. The one complication I see is the
NA
redcap_event_instance
in the "NR, RT" case.Presumably if you join "NR, RT" to "NR, RT" you want the
NA
s to join in the non repeating event which is a little non-standard.Things get a little more complicated when considering the mixed structure instruments. For example, suppose you're joining "RS, RT" to "RS, RT". Presumably you'd want to join on all the keys for records in the RT event but on only
record_id
andredcap_event
for the RS records.If we wanted to cover all our bases we could try to enumerate the keys we'd expect to join on for all 28 possible combinations.
Missing keys
The above assumes that all instrument-event pairs have data. I noticed that when this isn't true the keys that come back for each data tibble aren't consistent. For example, in the "NR, RS" instrument
redcap_form_instance
is only included if data is filled out for that instrument in the RS event. Not sure how much this matters for the joins but it might be an additional consideration.Checklist