CHOP-CGTInformatics / REDCapTidieR

Makes it easy to read REDCap Projects into R
https://chop-cgtinformatics.github.io/REDCapTidieR/
Other
33 stars 8 forks source link

[BUG] Misleading Mixed Data Structure Outputs #204

Closed rsh52 closed 3 weeks ago

rsh52 commented 1 month ago

Expected Behavior

When joining mixed data structures with read_redcap(..., allow_mixed_structure = TRUE), we expect tables to be made with meaningful, unique primary keys.

Current Behavior

During development of #199, we found that joining tables that both "repeat together" (i.e. as events, RT) and "repeat separate" (i.e. as instruments, RS) don't provide the appropriate primary keys to allow for distinction of RS rows during joins.

How to Reproduce the Bug:

Using the REDCap for a single record as set up below: image

Currently read_redcap() gives us the following output:

> sprtbl$redcap_data
[[1]]
# A tibble: 4 × 5
  record_id redcap_event       redcap_form_instance mixed_structure_1 form_status_complete
      <dbl> <chr>                             <dbl> <chr>             <fct>               
1         1 repeating_together                    1 RT1               Complete            
2         1 repeating_together                    2 RT2               Complete            
3         1 repeating_separate                    1 RS1               Complete            
4         1 repeating_separate                    2 RS2               Complete            

[[2]]
# A tibble: 3 × 5
  record_id redcap_event       redcap_form_instance mixed_structure_2 form_status_complete
      <dbl> <chr>                             <dbl> <chr>             <fct>               
1         1 repeating_together                    1 RT1               Complete            
2         1 repeating_together                    2 RT2               Complete            
3         1 repeating_separate                    1 RS1               Complete   

Using join_data_tibbles() with a "full join" we get this:

join_data_tibbles(sprtbl, x = "mixed_structure_1", y = "mixed_structure_2", type = "full")
# A tibble: 4 × 8
  record_id redcap_event       redcap_form_instance mixed_structure_1 redcap_event_instance mixed_structure_2 form_status_complete.x form_status_complete.y
      <dbl> <chr>                             <dbl> <chr>             <lgl>                 <chr>             <fct>                  <fct>                 
1         1 repeating_together                    1 RT1               NA                    RT1               Complete               Complete              
2         1 repeating_together                    2 RT2               NA                    RT2               Complete               Complete              
3         1 repeating_separate                    1 RS1               NA                    RS1               Complete               Complete              
4         1 repeating_separate                    2 RS2               NA                    NA                Complete               NA        

The issue here is that in row 3, data for mixed_structure_1 and mixed_structure_2 should exist on separate rows because they are RS instances. As read_redcap() is currently set up, it is impossible to separate these because the primary keys for both are identical (record_id, redcap_event, redcap_form_instance). This is a by product of how we decided to mix redcap_form_instances meaning between repeat events and instruments

Solution Proposal

To fix this we will need to do the following:

> tibble::tribble(
+   ~"record_id", ~"redcap_event", ~"redcap_form_instance", ~"redcap_event_instance", ~"extra_rs_key", ~"mixed_structure_1", ~"mixed_structure_2", ~"form_status_complete.x", ~"form_status_complete.y",
+   1, "repeat_together", 1, NA, NA, "RT1", "RT1", "Complete", "Complete",
+   1, "repeat_together", 2, NA, NA, "RT2", "RT2", "Complete", "Complete",
+   1, "repeat_separate", 1, NA, "mixed_structure_1", "RS1", NA, "Complete", NA,
+   1, "repeat_separate", 1, NA, "mixed_structure_2", NA, "RS1", NA, "Complete",
+   1, "repeat_separate", 2, NA, "mixed_structure_1", "RS2", NA, "Complete", NA
+   )
# A tibble: 5 × 9
  record_id redcap_event    redcap_form_instance redcap_event_instance extra_rs_key      mixed_structure_1 mixed_structure_2 form_status_complete.x form_status_complete.y
      <dbl> <chr>                          <dbl> <lgl>                 <chr>             <chr>             <chr>             <chr>                  <chr>                 
1         1 repeat_together                    1 NA                    NA                RT1               RT1               Complete               Complete              
2         1 repeat_together                    2 NA                    NA                RT2               RT2               Complete               Complete              
3         1 repeat_separate                    1 NA                    mixed_structure_1 RS1               NA                Complete               NA                    
4         1 repeat_separate                    1 NA                    mixed_structure_2 NA                RS1               NA                     Complete              
5         1 repeat_separate                    2 NA                    mixed_structure_1 RS2               NA                Complete               NA      

Checklist

Before submitting this issue, please check and verify below that the submission meets the below criteria:

rsh52 commented 1 month ago

Hm, going to need to think more about this. After doing some debugging, the solution I proposed essentially is just leading us back to re-implementing redcap_repeat_instrument/redcap_event_name which we originally wanted to avoid.

db_data_long
# A tibble: 5 × 8
  record_id redcap_event_name        redcap_repeat_instrument redcap_repeat_instance mixed_structure_1 mixed_structure_1_complete mixed_structure_2 mixed_structure_2_complete
      <dbl> <chr>                    <chr>                                     <dbl> <chr>             <fct>                      <chr>             <fct>                     
1         1 repeating_together_arm_1 NA                                            1 RT1               Complete                   RT1               Complete                  
2         1 repeating_together_arm_1 NA                                            2 RT2               Complete                   RT2               Complete                  
3         1 repeating_separate_arm_1 mixed_structure_1                             1 RS1               Complete                   NA                NA                        
4         1 repeating_separate_arm_1 mixed_structure_1                             2 RS2               Complete                   NA                NA                        
5         1 repeating_separate_arm_1 mixed_structure_2                             1 NA                NA                         RS1               Complete   

In trying to separate rows based on RS, what we're looking for is sparsity that we nixed on principle when consolidating repeating instances under redcap_form_instance in our data tibbles.

rsh52 commented 1 month ago

This ensures RSs are joined correctly, may repeat information from the table being joined on, but that this is technically a "correct" join. Versus a "stacked" table where the observations are staggered, which is what the REDCapR/sparse output gives you.

What we need to do: update how we determine repeat events and move those values over to redcap_event_instance, then update join_data_tibbles() joining operation.

rsh52 commented 1 month ago

This ensures RSs are joined correctly, may repeat information from the table being joined on, but that this is technically a "correct" join. Versus a "stacked" table where the observations are staggered, which is what the REDCapR/sparse output gives you.

After banging my head against this for a while, I don't think this is actually correct either and am worried that getting this to work would create an extremely opaque output.

Consider this output, with a newly-implemented .repeat_type col taken from the updates in #205 where we can now access the repeat event type from the redcap_events col of the supertibble. This uses the extra mixed test db:

Browse[1]> x
# A tibble: 4 × 7
  record_id redcap_event       redcap_form_instance redcap_event_instance mixed_structure_1 form_status_complete .repeat_type   
      <dbl> <chr>                             <dbl>                 <dbl> <chr>             <fct>                <chr>          
1         1 repeating_together                   NA                     1 RT1               Complete             repeat_together
2         1 repeating_together                   NA                     2 RT2               Complete             repeat_together
3         1 repeating_separate                    1                    NA RS1               Complete             repeat_separate
4         1 repeating_separate                    2                    NA RS2               Complete             repeat_separate
Browse[1]> y
# A tibble: 3 × 7
  record_id redcap_event       redcap_form_instance redcap_event_instance mixed_structure_2 form_status_complete .repeat_type   
      <dbl> <chr>                             <dbl>                 <dbl> <chr>             <fct>                <chr>          
1         1 repeating_together                   NA                     1 RT1               Complete             repeat_together
2         1 repeating_together                   NA                     2 RT2               Complete             repeat_together
3         1 repeating_separate                    1                    NA RS1               Complete             repeat_separate
Browse[1]> dplyr::full_join(x,y,by)
# A tibble: 4 × 10
  record_id redcap_event       redcap_form_instance redcap_event_instance mixed_structure_1 form_status_complete.x .repeat_type.x  mixed_structure_2 form_status_complete.y .repeat_type.y 
      <dbl> <chr>                             <dbl>                 <dbl> <chr>             <fct>                  <chr>           <chr>             <fct>                  <chr>          
1         1 repeating_together                   NA                     1 RT1               Complete               repeat_together RT1               Complete               repeat_together
2         1 repeating_together                   NA                     2 RT2               Complete               repeat_together RT2               Complete               repeat_together
3         1 repeating_separate                    1                    NA RS1               Complete               repeat_separate RS1               Complete               repeat_separate
4         1 repeating_separate                    2                    NA RS2               Complete               repeat_separate NA                NA                     NA  

Before we said that the desired output would be for those last three NAs in row 4 to be repeats of the values in row 3. But this doesn't make sense of what should be done with the extra data in y that also qualifies based on the join-by columns (i.e. repeating separate instance 2+) since we're not using redcap_form_instance as a by column. I tried out separating the behaviors of these into RT x and y tables, nonrepeating form RS event tables, and repeating form RS event tables, joining by the columns that individually mattered, and binding the rows. But this output is complicated and untraceable for a user.

I'll think some more on this, but think this warrants a larger discussion. In the end we need to make sure we "intelligently join" on columns that are transparent to the user and handling this succinctly with the various behaviors an individual form/table gives proves very challenging.