brad-cannell / codebookr

Create Codebooks From Data Frames
25 stars 5 forks source link

Codebook is slow #17

Closed mbcann01 closed 2 years ago

mbcann01 commented 2 years ago

While running the codebook function on the L2C data, I realized how slow it is. In some ways, this may not be a huge issue because we probably want need to recreate codebooks often. Having said that, it might be nice to try to find ways to speed up the code.

Using HTML instead of Word (#5) might be a good way to speed it up.


The solution for this problem came from:

Inserting a document of course allows you to integrate a previously-created Word document into another document. This can be useful when certain parts of a document need to be written manually but automatically integrated into a final document. The document to be inserted must be in docx format. This can be done by using function body_add_docx(). This can be advantageous when you are generating huge documents and the generation is getting slower and slower. It is necessary to generate smaller documents and to design a main script that inserts the different documents into a main Word document.

mbcann01 commented 2 years ago

Working on issue #17. Codebook is slow.

data_stata <- haven::read_dta("inst/extdata/study.dta")

How long does it take to run regular data?

  times = 10L
) # 2-3 seconds each run.

How long does it to run on Stata data?

  times = 10L
) # 2-3 seconds each run

So, that doesn't seem to make a huge difference.

What are the slow parts?


The Flextable stuff is the slowest part. I'm not sure if I can speed that up or not.


Flextable stuff for this one too.

Can I do the Flextable stuff at once outside of a loop? Will that make any difference?

Do more rows slow it down?

df_short <- tibble(x = rnorm(100)) # 100 rows
df_medium <- tibble(x = rnorm(10000)) # 10,000 rows
df_long <- tibble(x = rnorm(10000000)) # 10,000,000 rows
  codebook(df_short),  # Mean = 347 milliseconds
  codebook(df_medium), # Mean = 1589 milliseconds
  codebook(df_long),   # Mean = 4212    milliseconds
  times = 10L

So, adding more observations slows it down. 100 to 10,000 = 4 times as long 100 to 10,000,000 = 12 times as long

Do more columns slow it down?

# Keep the first 100 rows of df_long only
df_medium <- df_medium[1:100,]
# Make 100 column names from combinations of letters
cols <- unique(paste0(sample(letters, 100, TRUE), sample(letters, 100, TRUE), sample(letters, 100, TRUE)))
for (col in cols) {
  df_medium[[col]] <- rnorm(100)
  codebook(df_short),  # Mean = 300 milliseconds
  codebook(df_medium), # Mean = 52776   milliseconds (52 seconds)
  times = 1L

So, adding more columns slows it down A LOT! 1 to 100 = 175 times as long!

What parts of the code take the longest to run?


The flextable parts take the longest (i.e., body_add_flextable and regular_table).


The flextable parts take the longest (i.e., body_add_flextable, body_add_par, and regular_table).


unique.default and cb_add_summary stats take the longest.

There isn't a way for me to change the internals of the flextable functions, but I do wonder if me applying them in a different way would speed things up?