ctsit / redcapcustodian

Simplified, automated data management on REDCap systems
Other
12 stars 6 forks source link

Add data diff function to aid one-way data synch #21

Closed pbchase closed 2 years ago

pbchase commented 2 years ago

Write a data diff function that can generate difference data for two datasets, source and target. The goal is to allow intelligent one-way synchronization between these two datasets using database CRUD operations. There are two assumptions about source and target:

  1. Columns of source are a subset of the columns of target.
  2. Though both source and target share their primary key, target adds a primary key that is different from the primary key in source.

Here is an example that illustrates the problem and provides some prototype code:

dataset_diff <- function(source, source_pk, target, target_pk) {

  updated_records <- source %>%
    anti_join(target) %>%
    inner_join(target, by=source_pk) %>%
    select(any_of(target_pk), any_of(names(source)), ends_with(".x")) %>%
    rename_with(., ~ gsub(".x", "", .x), ends_with(".x"))

  ids_of_updated_records <- updated_records %>% dplyr::pull({{target_pk}})

  new_records <- source %>%
    anti_join(target) %>%
    anti_join(updated_records)

  deleted_records <-
    target %>%
    anti_join(source) %>%
    filter(! (!!as.symbol(target_pk)) %in% ids_of_updated_records)

  return(list(
    updated_records = updated_records,
    new_records = new_records,
    deleted_records = deleted_records    
  ))
}

a <- tribble(
  ~id, ~pk, ~bar, ~bang,
  1,1,1,1,
  2,2,2,1,
  3,3,3,1,
  4,4,4,1,
  6,6,6,1
)

b <- tribble(
  ~pk, ~bar,
  1,1,
  2,2,
  3,1,
  4,1,
  5,5
)

dataset_diff(source = b,
             source_pk = "pk",
             target = a,
             target_pk = "id"
             )

Data synchronization functions could use the output of this function as the primary input for CRUD operations against a database table, with the contents of target that need to reflect the data in source.

Which rows get written and what additional columns might accompany the source data are implementation dependent and outside the scope of this function.

ChemiKyle commented 2 years ago

Addressed by #46