Read `job_summary_data` from multiple log rows

pbchase commented 1 year ago

It would be helpful to have a function that codifies the recipe for reading the job_summary_data out of a log entry for a group of similar records. The need derives from our use of JSON to store the data, MySQL's inability to parse it, and JSONLite's lack of vectorization. Reading one record of job_summary_data data is easy, but most use cases can't identify the single record of interest or they want records across a span of time. Here's some sample code that reads all the records saved from a script named _warn_owners_of_impendingbill

library(dotenv)
load_dot_env("prod.env")

library(redcapcustodian)
library(rcc.billing)
library(RMariaDB)
library(DBI)
library(tidyverse)
library(lubridate)
library(dotenv)

init_etl("pbc_scratch")

log_conn <- get_package_scope_var("log_con")

log_data <- tbl(log_conn, "rcc_job_log") %>%
  dplyr::filter(script_name == "warn_owners_of_impending_bill") %>%
  collect()

log_ids <- log_data$id
names(log_ids) <- log_ids

purrr::map2_dfr(log_ids, log_data$job_summary_data, ~ jsonlite::fromJSON(.y), .id = "id") %>%
  mutate(id = as.integer(id)) %>%
  left_join(log_data %>% select(-job_summary_data), by = "id")

The interesting part is the last two stanzas of code where we give names to a vector of row IDs to query, iteratively read the JSON from those rows, then join the rest of the log data to the decoded JSON:

log_ids <- log_data$id
names(log_ids) <- log_ids

purrr::map2_dfr(log_ids, log_data$job_summary_data, ~ jsonlite::fromJSON(.y), .id = "id") %>%
  mutate(id = as.integer(id)) %>%
  left_join(log_data %>% select(-job_summary_data), by = "id")

Putting just these components into the function allows the caller to decide how many records from what script or scripts get decoded. We already have use cases for this function where we want to inspect logged email send failures generated by multiple scripts. As we logged the email failures the same way in each function, aggregating the send failures across them is trivial with a function like this.

This function will need a name. I offer these suggestions:

_spread_job_summarydata
_decode_job_summarydata
_decode_job_summaryrecords
_decode_jobsummary
_decode_jobsummaries
_read_jobsummaries

We need to keep the distinction in people's minds between the redcapcustodian job logs and REDCap's redcap_log_event table

@ChemiKyle, have any ideas on naming?

pbchase commented 1 year ago

more name suggestions:

unnest_job_summary_records
unnest_job_summaries
unnest_job_summary_data
unnest_job_summary_data_records
unnest_job_summary_data_column
unnest_job_log_data

pbchase commented 1 year ago

I'm choosing the name unnest_job_summary_data

pbchase commented 1 year ago

I'm choosing the name unnest_job_summary_data

No, let's make that unnest_job_summary_data_json_object

pbchase commented 11 months ago

addressed in PR# 111

ctsit / redcapcustodian

Read `job_summary_data` from multiple log rows #109