Closed rsh52 closed 3 weeks ago
Hm, going to need to think more about this. After doing some debugging, the solution I proposed essentially is just leading us back to re-implementing redcap_repeat_instrument
/redcap_event_name
which we originally wanted to avoid.
db_data_long
# A tibble: 5 × 8
record_id redcap_event_name redcap_repeat_instrument redcap_repeat_instance mixed_structure_1 mixed_structure_1_complete mixed_structure_2 mixed_structure_2_complete
<dbl> <chr> <chr> <dbl> <chr> <fct> <chr> <fct>
1 1 repeating_together_arm_1 NA 1 RT1 Complete RT1 Complete
2 1 repeating_together_arm_1 NA 2 RT2 Complete RT2 Complete
3 1 repeating_separate_arm_1 mixed_structure_1 1 RS1 Complete NA NA
4 1 repeating_separate_arm_1 mixed_structure_1 2 RS2 Complete NA NA
5 1 repeating_separate_arm_1 mixed_structure_2 1 NA NA RS1 Complete
In trying to separate rows based on RS, what we're looking for is sparsity that we nixed on principle when consolidating repeating instances under redcap_form_instance
in our data tibbles.
redcap_form_instance
moved over to redcap_event_instance
record_id
and redcap_event
and not redcap_form_instance
This ensures RSs are joined correctly, may repeat information from the table being joined on, but that this is technically a "correct" join. Versus a "stacked" table where the observations are staggered, which is what the REDCapR/sparse output gives you.
What we need to do: update how we determine repeat events and move those values over to redcap_event_instance
, then update join_data_tibbles()
joining operation.
This ensures RSs are joined correctly, may repeat information from the table being joined on, but that this is technically a "correct" join. Versus a "stacked" table where the observations are staggered, which is what the REDCapR/sparse output gives you.
After banging my head against this for a while, I don't think this is actually correct either and am worried that getting this to work would create an extremely opaque output.
Consider this output, with a newly-implemented .repeat_type
col taken from the updates in #205 where we can now access the repeat event type from the redcap_events
col of the supertibble. This uses the extra mixed test db:
Browse[1]> x
# A tibble: 4 × 7
record_id redcap_event redcap_form_instance redcap_event_instance mixed_structure_1 form_status_complete .repeat_type
<dbl> <chr> <dbl> <dbl> <chr> <fct> <chr>
1 1 repeating_together NA 1 RT1 Complete repeat_together
2 1 repeating_together NA 2 RT2 Complete repeat_together
3 1 repeating_separate 1 NA RS1 Complete repeat_separate
4 1 repeating_separate 2 NA RS2 Complete repeat_separate
Browse[1]> y
# A tibble: 3 × 7
record_id redcap_event redcap_form_instance redcap_event_instance mixed_structure_2 form_status_complete .repeat_type
<dbl> <chr> <dbl> <dbl> <chr> <fct> <chr>
1 1 repeating_together NA 1 RT1 Complete repeat_together
2 1 repeating_together NA 2 RT2 Complete repeat_together
3 1 repeating_separate 1 NA RS1 Complete repeat_separate
Browse[1]> dplyr::full_join(x,y,by)
# A tibble: 4 × 10
record_id redcap_event redcap_form_instance redcap_event_instance mixed_structure_1 form_status_complete.x .repeat_type.x mixed_structure_2 form_status_complete.y .repeat_type.y
<dbl> <chr> <dbl> <dbl> <chr> <fct> <chr> <chr> <fct> <chr>
1 1 repeating_together NA 1 RT1 Complete repeat_together RT1 Complete repeat_together
2 1 repeating_together NA 2 RT2 Complete repeat_together RT2 Complete repeat_together
3 1 repeating_separate 1 NA RS1 Complete repeat_separate RS1 Complete repeat_separate
4 1 repeating_separate 2 NA RS2 Complete repeat_separate NA NA NA
Before we said that the desired output would be for those last three NA
s in row 4 to be repeats of the values in row 3. But this doesn't make sense of what should be done with the extra data in y
that also qualifies based on the join-by columns (i.e. repeating separate instance 2+) since we're not using redcap_form_instance
as a by
column. I tried out separating the behaviors of these into RT x
and y
tables, nonrepeating form RS event tables, and repeating form RS event tables, joining by the columns that individually mattered, and binding the rows. But this output is complicated and untraceable for a user.
I'll think some more on this, but think this warrants a larger discussion. In the end we need to make sure we "intelligently join" on columns that are transparent to the user and handling this succinctly with the various behaviors an individual form/table gives proves very challenging.
Expected Behavior
When joining mixed data structures with
read_redcap(..., allow_mixed_structure = TRUE)
, we expect tables to be made with meaningful, unique primary keys.Current Behavior
During development of #199, we found that joining tables that both "repeat together" (i.e. as events, RT) and "repeat separate" (i.e. as instruments, RS) don't provide the appropriate primary keys to allow for distinction of RS rows during joins.
How to Reproduce the Bug:
Using the REDCap for a single record as set up below:
Currently
read_redcap()
gives us the following output:Using
join_data_tibbles()
with a "full join" we get this:The issue here is that in row 3, data for
mixed_structure_1
andmixed_structure_2
should exist on separate rows because they are RS instances. Asread_redcap()
is currently set up, it is impossible to separate these because the primary keys for both are identical (record_id
,redcap_event
,redcap_form_instance
). This is a by product of how we decided to mixredcap_form_instance
s meaning between repeat events and instrumentsSolution Proposal
To fix this we will need to do the following:
redcap_events
column of the supertibble. This will need to identify if an event is a RS/RT for us to reference in other functions likejoin_data_tibbles()
redcap_form_instance
is used in these examples for RS rows. This may involve revising howredcap_form_instance
/redcap_event_instance
is defined but boils down to needing an additional primary key to help identify separate repeating instances. @ezraporter after making an example I thinkredcap_form_instance
/redcap_event_instance
might still not address this fully since these share the same event information. I think instead we need to identify RS data and then give it an additional column with the form name it came from. This might be able to get shifted back to the join function's responsibilities.Checklist
Before submitting this issue, please check and verify below that the submission meets the below criteria: