CHOP-CGTInformatics / REDCapTidieR

Makes it easy to read REDCap Projects into R
https://chop-cgtinformatics.github.io/REDCapTidieR/
Other
33 stars 8 forks source link

[FEATURE] Function to join data tibbles #199

Open ezraporter opened 2 months ago

ezraporter commented 2 months ago

Feature Request Description

Function to join together data tibbles that picks smart default keys to join on based on the events each tibble is mapped to.

Proposed Solution

Some possible APIs:

supertbl |>
  join_data_tibble(
    x = "tbl1",
    y = "tbl2",
    by = c("record_id", "redcap_event"), # optional! Inferred by default
    type = "left"
   )

# Return a tibble

supertbl |>
  join_data_tibble(
    x = "tbl1",
    y = "tbl2",
    by = c("record_id", "redcap_event"), # optional! Inferred by default
    type = "left",
    name = "my_new_tibble"
   )

# Return a supertibble with a new tibble added

Additional Context

Data tibble types

API issues aside, I think the key here is that we'd need a way to determine for any two data tibbles which keys to use in a join by default.

Each instrument-event pair can have one of 3 types: Non Repeating, Repeating Separately, or Repeating Together. When I talk about the "type" of an instrument, I mean the aggregation of all the types of instrument-event pairs the instrument is part of. There are 7 options which I mocked up in an example REDCap.

Screenshot 2024-08-28 at 2 03 54 PM Screenshot 2024-08-28 at 2 02 40 PM
library(REDCapTidieR)
library(tidyverse)

data <- read_redcap(Sys.getenv("REDCAP_URI"), "77C2C3A96E00246436A9E49974BD5674", allow_mixed_structure = TRUE)

extract_keys <- function(data) {
  data |>
    colnames() |>
    intersect(c("record_id", "redcap_event", "redcap_form_instance", "redcap_event_instance")) |>
    paste(collapse = ", ")
}

data |>
  mutate(pks = map_chr(redcap_data, extract_keys)) |>
  select(redcap_form_label, pks, structure)

#> A REDCapTidieR Supertibble with 7 instruments
#>  redcap_form_label pks                                            structure   
#>  <chr>             <chr>                                          <chr>       
#>1 NR                record_id, redcap_event                        nonrepeating
#>2 RS                record_id, redcap_event, redcap_form_instance  repeating   
#>3 RT                record_id, redcap_event, redcap_event_instance nonrepeating
#>4 NR, RS            record_id, redcap_event, redcap_form_instance  mixed       
#>5 RS, RT            record_id, redcap_event, redcap_form_instance  mixed       
#>6 NR, RT            record_id, redcap_event, redcap_event_instance nonrepeating
#>7 NR, RS, RT        record_id, redcap_event, redcap_form_instance  mixed 

"Type" here is slightly more specific than "structure". I think for joining any two non-mixed instruments the default should be to join on all common keys. The one complication I see is the NA redcap_event_instance in the "NR, RT" case.

data |>
  extract_tibble("nr_rt")

#># A tibble: 2 × 5
#>  record_id redcap_event       redcap_event_instance nr_rt form_status_complete
#>      <dbl> <chr>                              <dbl> <dbl> <fct>               
#>1         1 non_repeating                         NA     1 Incomplete          
#>2         1 repeating_together                     1     1 Incomplete          

Presumably if you join "NR, RT" to "NR, RT" you want the NAs to join in the non repeating event which is a little non-standard.

Things get a little more complicated when considering the mixed structure instruments. For example, suppose you're joining "RS, RT" to "RS, RT". Presumably you'd want to join on all the keys for records in the RT event but on only record_id and redcap_event for the RS records.

If we wanted to cover all our bases we could try to enumerate the keys we'd expect to join on for all 28 possible combinations.

Missing keys

The above assumes that all instrument-event pairs have data. I noticed that when this isn't true the keys that come back for each data tibble aren't consistent. For example, in the "NR, RS" instrument redcap_form_instance is only included if data is filled out for that instrument in the RS event. Not sure how much this matters for the joins but it might be an additional consideration.

Checklist

rsh52 commented 1 month ago

Never got around to giving you a good response, but looking over this now with some thoughts.

API:

So far as the API goes, originally I was going to say return a supertibble since that's what the new combined_checkboxes() does (based on the idea of it gets a supertibble, so it returns a supertibble). But on second thought, I don't know that I want this function doing the heavy lifting of appending/editing the supertibble. What would the output look like if we returned a supertibble, would you have a new row added that joined the two? Would it then also need to update the metadata tibble and remaining analytics columns? Or would it combine the two original columns (and still need to update the metadata and remaining columns)?

IMO the easiest thing to start with here is returning just a data tibble. I think that's the easiest for a user to visually confirm and do what they wish with, and if we do name the function join_data_tibbles() (I think this works for the name) then it's not unreasonable to expect a data tibble output.

NR/RT

The one complication I see is the NA redcap_event_instance in the "NR, RT" case. ... Presumably if you join "NR, RT" to "NR, RT" you want the NAs to join in the non repeating event which is a little non-standard.

This is pseudo-non standard if I understand right. extract_keys() or some intermediate step would give us the relationship/type identifier. Could we just use a helper column from that to join on instead of joining on NA? I think that's going to be the important piece for all of this including the RS/RT to RS/RT joins. That type designation would help funnel which of the 28 operations to enforce.

My hope would be if we approach this as tackling each relationship one at a time then we can just build this up piece by piece.

rsh52 commented 1 month ago

@ezraporter Started working on this a little over here.

When you have a moment care to take a look and see if it's structured in a way that makes sense so far, borrowing from your initial suggestions? My hope and first milestone I want to reach is that this works for all non-mixed join relationships (I believe it does) and that the next piece will be to insert logic where is_mixed is true.

ezraporter commented 1 month ago

This looks like a good start! Just a few random thoughts:

  1. I'm fine with approaching it non-mixed then mixed but I'm not convinced the final code should follow that structure. Once we consider the mixed case there might be a larger pattern that doesn't split along the mixed/non-mixed line and we should be looking for that. So I say keep going this way but don't get stuck in the distinction.

  2. If we go the route of adding attributes to the tibbles like pks and structure we should a) think a bit more about what attributes we want to include and consider making an actual class and b) do those computations in read_redcap(). I'm not fully convinced we need more than functional programming here though.

  3. I'm not sure the rlang::exec is necessary. If it's helpful here's what popped into my head:

join_data_tibbles <- function(...) {
  join_fn <- get_join_fn(type) # returns a function, ex. `left_join`
  tbl_x <- extract_tibble(supertbl, x)
  tbl_x_type <- get_structure(tbl_x) # returns a string with the structure, ex. "mixed"
  tbl_y <- extract_tibble(supertbl, y)
  tbl_y_type <- get_structure(tbl_y)

  by <- build_by(tbl_x, tbl_x_type, tbl_y, tbl_y_type) # returns a vector of keys, ex. c(`record_id_field`, "redcap_event")

  # do the join!
  join_fn(tbl_x, tbl_y, by = by)
}
rsh52 commented 1 month ago

This looks like a good start! Just a few random thoughts:

  1. I'm fine with approaching it non-mixed then mixed but I'm not convinced the final code should follow that structure. Once we consider the mixed case there might be a larger pattern that doesn't split along the mixed/non-mixed line and we should be looking for that. So I say keep going this way but don't get stuck in the distinction.

For sure, and I'm sure we can get there but for now, at least the way my brain works, I wanted to approach things stepwise.

  1. If we go the route of adding attributes to the tibbles like pks and structure we should a) think a bit more about what attributes we want to include and consider making an actual class and b) do those computations in read_redcap(). I'm not fully convinced we need more than functional programming here though.

Yea I need to think about this more since between this and record_status_dashboard() I want to make a few more things accessible from the supertibble.

  1. I'm not sure the rlang::exec is necessary. If it's helpful here's what popped into my head:
join_data_tibbles <- function(...) {
  join_fn <- get_join_fn(type) # returns a function, ex. `left_join`
  tbl_x <- extract_tibble(supertbl, x)
  tbl_x_type <- get_structure(tbl_x) # returns a string with the structure, ex. "mixed"
  tbl_y <- extract_tibble(supertbl, y)
  tbl_y_type <- get_structure(tbl_y)

  by <- build_by(tbl_x, tbl_x_type, tbl_y, tbl_y_type) # returns a vector of keys, ex. c(`record_id_field`, "redcap_event")

  # do the join!
  join_fn(tbl_x, tbl_y, by = by)
}

I like this set up, will restructure to accommodate 👍

rsh52 commented 1 month ago

Presumably if you join "NR, RT" to "NR, RT" you want the NAs to join in the non repeating event which is a little non-standard. Things get a little more complicated when considering the mixed structure instruments. For example, suppose you're joining "RS, RT" to "RS, RT". Presumably you'd want to join on all the keys for records in the RT event but on only record_id and redcap_event for the RS records. If we wanted to cover all our bases we could try to enumerate the keys we'd expect to join on for all 28 possible combinations.

Coming back to this, I'm wondering if this can actually be made simpler? During the joins if we artificially create a redcap_event_instance and redcap_form_instance for the tables that don't have them and fill them with NA then I believe this should cover things. It would make for some weird outputs, like empty data for left and right joins but would have all of the expected data for full joins which I think users would want to specify anyway.

Going to test this a bit more and hope to update the branch.

rsh52 commented 1 month ago

@ezraporter When you have a moment, can you trial out the code on the join-tibbles-branch? If you want I can write up a test suite & PR if that makes it easier to review, but thought it might still be too soon to do that.

The main changes I made for mixed structure handling were to check the x and y tables for the existence of redcap_event_instance / redcap_form_instance and add them in if missing. This way joining with NA is meaningful when there is a non-repeat table.

As I mentioned above, when performing the mixed structure joins I think it's more common that users will need to select full_join() for their operation if they want to get all the observations from both tables. Otherwise left and right are just going to tack on empty columns from the opposite table.

Try out below and playing with the type arg:

sprtbl <- read_redcap(Sys.getenv("REDCAP_URI"), token = Sys.getenv("REDCAPTIDIER_MIXED_STRUCTURE_API"), 
                      allow_mixed_structure = TRUE)

join_data_tibbles(suprtbl = sprtbl, x = "nonrepeat_form", y = "mixed_structure_form")

join_data_tibbles(suprtbl = sprtbl, x = "nonrepeat_form", y = "repeat_form")

join_data_tibbles(suprtbl = sprtbl, x = "repeat_form", y = "mixed_structure_form")

At the moment if you join the RT/RS tables (the third join) there will be an empty redcap_event_instance column leftover. This probably needs to be dropped since it's confusingly not meaningful for repeating events with our set up, but wanted to get your thoughts.