Benchmark execution speed and test coverage

ablack3 commented 1 week ago

We would like to show some metrics after the sprint.

Execution speed before (current release) and after the sprint
Test coverage before and after (also consider test quality)
Test speed before and after

mvankessel-EMC commented 1 week ago

I parsed logs from an old log-file, so this is mostly an example:

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")

  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized),
      duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
    )
}

log <- readLog(path)

g1 <- ggplot(data = log, mapping = aes(x = time_normalized, y = duration)) +
  geom_line()

g1 +
  geom_text(data = log[log$duration > 30, ], mapping = aes(x = time_normalized, y = duration, label = task))
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_text()`).

^{Created on 2024-10-08 with reprex v2.1.1}

ablack3 commented 1 week ago

How about using the executionTimes.csv output file instead of parsing the log file? One thing I did notice was that the unit of executionTime could be second, minute, or hour. I fixed that in the the darwin_sprint branch and opened an issue here: https://github.com/OHDSI/CohortDiagnostics/issues/1144

Also I think the nesting/parent idea will need to stay (unless you have a proposal for changing it) so you have a hierachy of times in this file.

For plotting what about something like this from profivis?

It's also important to compare two executionTime.csv files from different databases, executions, and possibly versions of the CohortDiagnostics package.

I was thinking of changing the output column from executionTime to executionTimeMinutes to make the time unit clear but then that would break the output format so I'm unsure about that change.

mvankessel-EMC commented 1 week ago

I like the idea of using the executionTimes.csv file. But like you said, I think its pretty much unusable without knowing what unit the recorded duration time is in. Looking at the screenshot, I think you can use the startTime column sometimes, as the time stamps align with the duration.

As for the hierarchy, I don't really mind either way. I think it would be neat to have a plot like profvis.

mvankessel-EMC commented 1 week ago

The plotly plot didn't render. But it makes it so you can hover over the start of a task, and it will show information, like the task name, and the message, etc.

I tried to emulate the plot you showed, so I did boxes. I tried just horizontal lines aswell, but tasks that last 0 seconds won't show up then. So I think at least keeping one vertical, and one horizontal line would be nice.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

library(ggplot2)

path <- "D:/Users/mvankessel/Documents/CohortDiagnosticsSprint/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")

  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized),
      duration = -as.numeric(difftime(.data$time, dplyr::lead(.data$time), units = "secs"))
    )
}

splitLogOnParent <- function(log) {
  level0Rows <- log$message %>% 
    stringr::str_detect(pattern = "^(?!([ ]{1,})?-).+$")
  level0Rows <- seq(1, nrow(log))[level0Rows]

  l <- sapply(seq_len(length(level0Rows)), function(i) {
    if (!is.na(level0Rows[i + 1])) {
      block <- seq(level0Rows[i], level0Rows[i + 1] - 1)
      d <- dplyr::tibble(log[block, ]) %>%
        dplyr::mutate(time_end = max(time_normalized) - time_normalized)
      return(d)
    }
  })

  l[!sapply(l, is.null)]
}

log <- readLog(path)
splitLog <- splitLogOnParent(log)

f <- function(splitLog) {
  splitLog <- lapply(splitLog, function(log) {
    log %>%
      mutate(level = dplyr::case_when(
        startsWith(.data$message, " -") ~ 1,
        startsWith(.data$message, "  -") ~ 2,
        startsWith(.data$message, "   -") ~ 3,
        startsWith(.data$message, "    -") ~ 4,
        startsWith(.data$message, "     -") ~ 5,
        startsWith(.data$message, "      -") ~ 6,
        .default = 0
      ))
  }) %>%
    dplyr::bind_rows()
  ggplot(
    data = splitLog,
    mapping = aes(
      text = sprintf("task: %s\nmessage: %s", task, message)
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized + duration,
      y = level,
      yend = level
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized + duration,
      y = level + 0.3,
      yend = level + 0.3
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized,
      xend = time_normalized,
      y = level,
      yend = level + 0.3
    )
  ) +
  geom_segment(
    mapping = aes(
      x = time_normalized + duration,
      xend = time_normalized + duration,
      y = level,
      yend = level + 0.3
    )
  ) +
  theme_minimal()
}

g <- f(splitLog)
g
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.


plotly::ggplotly(g)
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Don't know how to automatically pick scale for object of type <difftime>.
#> Defaulting to continuous.
#> Error in `with_random_port()`:
#> ! Cannot find an available port. Please try again.
#> Caused by error in `startup()`:
#> ! Failed to start chrome. Error:

^{Created on 2024-10-09 with reprex v2.1.1}

mvankessel-EMC commented 1 week ago

Another log parse, but this time I took the times directly from the messages. With some regex, I was able to get the time units, and converted them all to seconds.

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union


path <- "D:/CohortDiagnosticsBench/results/log.txt"

readLog <- function(path) {
  df <- read.csv(path, sep = "\t", header = FALSE)
  names(df) <- c("time", "thread", "level", "package", "task", "message")

  df %>%
    dplyr::tibble() %>%
    dplyr::mutate(
      time = as.POSIXct(x = .data$time, format = "%Y-%m-%d %H:%M:%OS"),
      time_normalized = difftime(.data$time, max(.data$time), units = "secs"),
      time_normalized = .data$time_normalized - min(.data$time_normalized)
    )
}

log <- readLog(path)

df <- bind_rows(
  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? secs$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?"))
    ),

  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? mins$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 60
    ),

  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? hours$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600
    ),

  log[grep(pattern = "took \\d{1,}(\\.\\d{1,})? days$", log$message), ] |>
    mutate(
      duration = as.numeric(stringr::str_extract(string = .data$message, pattern = "\\d{1,}(\\.\\d{1,})?")) * 3600 * 24
    )
)

ggplot(data = df, mapping = aes(x = .data$task, y = duration, fill = package)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

^{Created on 2024-10-11 with reprex v2.1.1}

darwin-eu-dev / CohortDiagnostics

Benchmark execution speed and test coverage #71