brad-cannell / codebookr

Create Codebooks From Data Frames
https://brad-cannell.github.io/codebookr/
Other
25 stars 5 forks source link

Codebook is slow #17

Closed mbcann01 closed 2 years ago

mbcann01 commented 2 years ago

While running the codebook function on the L2C data, I realized how slow it is. In some ways, this may not be a huge issue because we probably want need to recreate codebooks often. Having said that, it might be nice to try to find ways to speed up the code.

https://www.r-bloggers.com/2021/04/code-performance-in-r-which-part-of-the-code-is-slow/

http://adv-r.had.co.nz/Performance.html

Using HTML instead of Word (#5) might be a good way to speed it up.

Solution

The solution for this problem came from: https://ardata-fr.github.io/officeverse/officer-for-word.html#external-documents

Inserting a document of course allows you to integrate a previously-created Word document into another document. This can be useful when certain parts of a document need to be written manually but automatically integrated into a final document. The document to be inserted must be in docx format. This can be done by using function body_add_docx(). This can be advantageous when you are generating huge documents and the generation is getting slower and slower. It is necessary to generate smaller documents and to design a main script that inserts the different documents into a main Word document.

mbcann01 commented 2 years ago

Working on issue #17. Codebook is slow.

library(dplyr)
library(codebookr)
library(microbenchmark)
library(profvis)
data(study)
data_stata <- haven::read_dta("inst/extdata/study.dta")

How long does it take to run regular data?

microbenchmark(
  codebook(study),
  times = 10L
) # 2-3 seconds each run.

How long does it to run on Stata data?

microbenchmark(
  codebook(data_stata),
  times = 10L
) # 2-3 seconds each run

So, that doesn't seem to make a huge difference.

What are the slow parts?

profvis(codebook(study))

The Flextable stuff is the slowest part. I'm not sure if I can speed that up or not.

profvis(codebook(data_stata))

Flextable stuff for this one too.

Can I do the Flextable stuff at once outside of a loop? Will that make any difference?

Do more rows slow it down?

df_short <- tibble(x = rnorm(100)) # 100 rows
df_medium <- tibble(x = rnorm(10000)) # 10,000 rows
df_long <- tibble(x = rnorm(10000000)) # 10,000,000 rows
microbenchmark(
  codebook(df_short),  # Mean = 347 milliseconds
  codebook(df_medium), # Mean = 1589 milliseconds
  codebook(df_long),   # Mean = 4212    milliseconds
  times = 10L
)

So, adding more observations slows it down. 100 to 10,000 = 4 times as long 100 to 10,000,000 = 12 times as long

Do more columns slow it down?

# Keep the first 100 rows of df_long only
df_medium <- df_medium[1:100,]
# Make 100 column names from combinations of letters
set.seed(123)
cols <- unique(paste0(sample(letters, 100, TRUE), sample(letters, 100, TRUE), sample(letters, 100, TRUE)))
for (col in cols) {
  df_medium[[col]] <- rnorm(100)
}
microbenchmark(
  codebook(df_short),  # Mean = 300 milliseconds
  codebook(df_medium), # Mean = 52776   milliseconds (52 seconds)
  times = 1L
)

So, adding more columns slows it down A LOT! 1 to 100 = 175 times as long!

What parts of the code take the longest to run?

profvis(codebook(df_short))

The flextable parts take the longest (i.e., body_add_flextable and regular_table).

profvis(codebook(df_medium))

The flextable parts take the longest (i.e., body_add_flextable, body_add_par, and regular_table).

profvis(codebook(df_long))

unique.default and cb_add_summary stats take the longest.

There isn't a way for me to change the internals of the flextable functions, but I do wonder if me applying them in a different way would speed things up?