ctsit / redcapcustodian

Simplified, automated data management on REDCap systems
Other
12 stars 6 forks source link

Describe how to extract log update records #144

Open pbchase opened 6 months ago

pbchase commented 6 months ago

I wrote this for a internal project to identify last-update events for fields and thus forms. I thought it woudl be complicated, but it was not that bad. This is the code that located every field update record for a single REDCap project, PID 2633. It has these features:

I feel like this code has a lot of value, but I am not yet sure what to do with it. There is value in knowing that it can be done and that the that the wall-time required to do it is modest even on very large dataset. There is value in the code that gets the raw data out of REDCap. There is value in the transformation from raw data to tidy. That transformation uses some more modern tidyr methods--separate_longer_delim and separate_wider_delim-- currently in the experimental lifecycle. They were perfectly suited for the job. Its hard to imagine expressing that trasnfrmaiton more succinctly.

Having said all that, maybe this begs for two functions, get_redcap_data_update_values() and make_redcap_data_update_values_tidy(), and a vignette that shows a developer how to use them. It would have a refactored version of this script in it.

library(tidyverse)
library(REDCapR)
library(dotenv)
library(lubridate)
library(DBI)
library(RMariaDB)
library(redcapcustodian)
library(openxlsx)
library(rcc.adrc)

library(dotenv)

load_dot_env("redcap_prod.env")

init_etl("extract_log_update_records")

rc_conn <- connect_to_redcap_db()
pid <- 2633

log_event_table <- tbl(rc_conn, "redcap_projects") |>
  filter(project_id == pid) |>
  collect() |>
  pull(log_event_table)

system.time( {
  project_update_records <- tbl(rc_conn, log_event_table) |>
    filter(project_id == pid & event == "UPDATE" & object_type == "redcap_data") |>
    collect() |>
    mutate(ts = lubridate::ymd_hms(ts))
})

project_update_records |>
  count(description)

project_update_records |> saveRDS(here::here("output", "log_db_update_records_from_pid_2633.rds"))

# project_update_records <- readRDS(here::here("output", "log_db_update_records_from_pid_2633.rds"))

field_updates <-
  project_update_records |>
  select(ts, data_values) |>
  separate_longer_delim(data_values, ",\n") |>
  separate_wider_delim(data_values, " = ", names = c("field_name", "value"), too_many = "merge") |>
  mutate(value = str_replace_all(value, "'", ""))

field_updates |>
  group_by(field_name) |>
  summarise(max_date = max(ts))