ctsit / redcapcustodian

Simplified, automated data management on REDCap systems
Other
12 stars 6 forks source link

Read individual bounce messages to find bad email addresses #69

Closed pbchase closed 1 year ago

pbchase commented 1 year ago

We need a function to identify bad email addresses based on bounce messages in an email inbox. This function would work much like get_bad_emails_from_listserv_digest, but use different pattern matching to locate the messages that document bounces and the email address identified within them. CTS-IT wrote some similar code for its billing processes. I have hacked that into a starter function:

#' Scrape an inbox for bad email addresses in bounce messages
#'
#' Connect to an imap mailbox, identify bad email addresses referenced in bounce
#' messages sent after `messages_since_date`, and extract the data from those emails.
#'
#' @param url The IMAP URL of the host that houses the mailbox
#' @param username The username of the IMAP mailbox
#' @param password The password of the IMAP mailbox
#' @param messages_since_date The sent date of the oldest message that should be inspected
#'
#' @return A dataframe of bounced email addresses
#' \itemize{
#'   \item{\code{email}}{character email address the bounced}
#' }
#' @export
#' @importFrom magrittr "%>%"
#' @importFrom rlang .data
#'
#' @examples
#' \dontrun{
#' get_bad_emails_from_individual_emails(
#'   username = "jdoe",
#'   password = "jane_does_password",
#'   url ="imaps://outlook.office365.com",
#'   messages_since_date = as.Date("2022-01-01", format = "%Y-%m-%d")
#'   )
#' }
get_bad_emails_from_individual_emails <- function(username,
                                                  password,
                                                  url = "imaps://outlook.office365.com",
                                                  messages_since_date) {
  imap_con <- mRpostman::configure_imap(
    url = url,
    username = username,
    password = password
  )

  imap_con$select_folder("INBOX")
  emails_by_subject_search <- imap_con$search_string(expr = "TBD SUBJECT", where = "SUBJECT")
  messages_since_date <- imap_con$search_since(date_char = format(messages_since_date, format = "%d-%b-%Y"))
  emails_found <- dplyr::intersect(emails_by_subject_search, messages_since_date)

  patterns <- c(
    "TBD"
  )

  data_from_emails <- tibble::tribble(
    ~email,
    "a"
  ) %>% dplyr::filter(F)

  if (length(emails_found) > 0) {
    for (email in emails_found) {

      data_row <- email %>%
        imap_con$fetch_text() %>%
        stringr::str_extract_all(patterns) %>%
        # remove html encoded < and > characters
        sub("&lt;.*&gt;", "", .) %>%
        # remove literal < and > characters
        sub("<.*>", "", .) %>%
        sub("\r\n", "", .)
        # TODO: finish this junky string transformation

      data_from_emails <- dplyr::bind_rows(data_from_emails, data_row)
    }
  }

  return(data_from_emails)
}

I don't recall the patterns that identify a bounce message or the patterns that extract the address from the body.

pbchase commented 1 year ago

In that example above, use these lines instead:

  emails_by_subject_search <- imap_con$search_string(expr = "TBD SUBJECT", where = "SUBJECT")
  emails_by_since_search <- imap_con$search_since(date_char = format(messages_since_date, format = "%d-%b-%Y"))
  emails_found <- dplyr::intersect(emails_by_subject_search, emails_by_since_search)