ctsit / redcapcustodian

Simplified, automated data management on REDCap systems
Other
12 stars 6 forks source link

Read `job_summary_data` from multiple log rows #109

Closed pbchase closed 11 months ago

pbchase commented 1 year ago

It would be helpful to have a function that codifies the recipe for reading the job_summary_data out of a log entry for a group of similar records. The need derives from our use of JSON to store the data, MySQL's inability to parse it, and JSONLite's lack of vectorization. Reading one record of job_summary_data data is easy, but most use cases can't identify the single record of interest or they want records across a span of time. Here's some sample code that reads all the records saved from a script named _warn_owners_of_impendingbill

library(dotenv)
load_dot_env("prod.env")

library(redcapcustodian)
library(rcc.billing)
library(RMariaDB)
library(DBI)
library(tidyverse)
library(lubridate)
library(dotenv)

init_etl("pbc_scratch")

log_conn <- get_package_scope_var("log_con")

log_data <- tbl(log_conn, "rcc_job_log") %>%
  dplyr::filter(script_name == "warn_owners_of_impending_bill") %>%
  collect()

log_ids <- log_data$id
names(log_ids) <- log_ids

purrr::map2_dfr(log_ids, log_data$job_summary_data, ~ jsonlite::fromJSON(.y), .id = "id") %>%
  mutate(id = as.integer(id)) %>%
  left_join(log_data %>% select(-job_summary_data), by = "id")

The interesting part is the last two stanzas of code where we give names to a vector of row IDs to query, iteratively read the JSON from those rows, then join the rest of the log data to the decoded JSON:

log_ids <- log_data$id
names(log_ids) <- log_ids

purrr::map2_dfr(log_ids, log_data$job_summary_data, ~ jsonlite::fromJSON(.y), .id = "id") %>%
  mutate(id = as.integer(id)) %>%
  left_join(log_data %>% select(-job_summary_data), by = "id")

Putting just these components into the function allows the caller to decide how many records from what script or scripts get decoded. We already have use cases for this function where we want to inspect logged email send failures generated by multiple scripts. As we logged the email failures the same way in each function, aggregating the send failures across them is trivial with a function like this.

This function will need a name. I offer these suggestions:

  1. _spread_job_summarydata
  2. _decode_job_summarydata
  3. _decode_job_summaryrecords
  4. _decode_jobsummary
  5. _decode_jobsummaries
  6. _read_jobsummaries

We need to keep the distinction in people's minds between the redcapcustodian job logs and REDCap's redcap_log_event table

@ChemiKyle, have any ideas on naming?

pbchase commented 1 year ago

more name suggestions:

  1. unnest_job_summary_records
  2. unnest_job_summaries
  3. unnest_job_summary_data
  4. unnest_job_summary_data_records
  5. unnest_job_summary_data_column
  6. unnest_job_log_data
pbchase commented 1 year ago

I'm choosing the name unnest_job_summary_data

pbchase commented 1 year ago

I'm choosing the name unnest_job_summary_data

No, let's make that unnest_job_summary_data_json_object

pbchase commented 11 months ago

addressed in PR# 111