brad-cannell / codebookr

Create Codebooks From Data Frames
https://brad-cannell.github.io/codebookr/
Other
25 stars 6 forks source link

Create a complete README example of using codebook #3

Closed mbcann01 closed 2 years ago

mbcann01 commented 2 years ago

Create a simple example of using codebook from start to finish to go on README. For now, I think that's all we'll need. We can add a vignette later if we really need to.

Note: Because this was my fist attempt at using codebook since it was in bfuncs (and the code was somewhat outdated), a lot of changes came up in the process of getting creating a simple example for README. Rather than create a separate issue for each of them, I document them below.

mbcann01 commented 2 years ago

I have all of the codebook R scripts moved over to the project files. I haven't used them in a long time, though. I can simultaneously relearn how to use the functions and create a complete example directly on the README.

mbcann01 commented 2 years ago

Working on README and got kind of off track. Here's what's going on. I decided to change all the codebook prefixes to cb (5e3092f90fcbe6e74d9f09e9a3636f680ca4ddee and 9071bdbc1199555a2468eaba868c4766a372c94b)

That's where I'm at now. I need to work my way back up through this list and finish README.

mbcann01 commented 2 years ago

Categorical stats and factors

While running check() I came across a problem with cb_summary_stats_few_cats. After changing sex from a character vector to a factor vector in the example study data, this chunk of code no longer worked:

# Change the category label for missing values from NA to "Missing"
dplyr::mutate(cat = tidyr::replace_na(cat, "Missing"))

It doesn't work because if tidyr::replace_na were to change NA to "Missing" it would essentially be adding a new factor level, which is beyond its scope. So, we have to first change the variable to a character vector and then change NA to "Missing". This does NOT change the variable from a factor vector to a character vector in the main data frame -- only in the frequency table data frame that will be inserted into the codebook.

Here is the new code:

dplyr::mutate(
      cat = as.character(cat),
      cat = tidyr::replace_na(cat, "Missing")
    )

We also made a similar change to cb_summary_stats_many_cats.

After making this change and running check() again, there was an error raised by the following code from cb_summary_stats_many_cats:

lowest <- df %>%
    dplyr::group_by({{ .x }}) %>%
    dplyr::summarise(n = n()) %>%

Essentially, the error wanted me to add dplyr:: to n(). Instead, I decided to replace this code with

lowest <- df %>%
    dplyr::count({{ .x }}) %>%

(The same applies to the code that calculates highest)

mbcann01 commented 2 years ago

Update to rlang 0.4.0

In the process of trying to make this change, it turns out that rlang::sym(.x) is no longer valid code. It returns the following error: Error in ``rlang::sym()`` at codebookr/R/cb_summary_stats_many_cats.R:19:2: Can't convert a function to a symbol.. As of rlang 0.4.0, this line of code is unnecessary. I'm removing it from the code and replacing all of the !!xs with {{ .x }} -- the new preferred tidy evaluation syntax.

mbcann01 commented 2 years ago

Updates to the codebook function

I'm finally to the point where I'm running the codebook() function and I've run into a couple of issues.

But there are some bigger, more fundamental changes I want to make too.

Remove the path argument

Currently, the codebook() function essentially expects you to pass it the data frame twice in two different ways:

  1. It expects you to pass an in-memory (i.e., in the global environment) version of the data frame to the df argument. This version of the data frame is the one that gets most of the action inside the codebook function.
  2. It expects you to pass a path to an on-disk (i.e., saved as .csv, .rds, etc.) version of the data frame to the path argument. It looks like this is only used to gather the last modified data value for the metadata table in the codebook.

So, the only purpose of the path argument is to gather the last modified data value for the metadata table in the codebook; yet, it comes with several downsides.

  1. Having a df and path argument in the codebook() function is confusing for the user. What if there isn't an on-disk version of the data frame? What if there are multiple on-disk versions of the data frame saved in different formats? Which one do you use?
  2. I had to copy a bunch of code from utils:::format.object_size. There's a note that says I copied the code because CRAN won't allow me to use ::: inside of my function. All of this code can be removed if I get rid of the path argument.

Instead, I can add a Last updated value to the metadata table. This would get around the downsides of using the path argument, and the date the codebook was last updated is probably no less useful than the date the file was last modified in most cases.

The group_by() error

After removing the path argument, I came back to the group_by error. I don't have it completely figured out yet, but it seems like removing all of the equo() syntax and replacing it with the curly-curly syntax is changing the way that .x is being passed down through the cascade of functions that create the summary tables.

For example, id, the first column in study gets passed to the cb_add_summary_stats() function inside of a loop inside of the codebook function via the .x argument (i.e., cb_add_summary_stats(col_nms[[i]])).

It then gets passed to the cb_summary_stats_many_cats() function inside of the cb_add_summary_stats() function via the .x argument (i.e., cb_summary_stats_many_cats(df, .x, n_extreme_cats)).

It appears as though this is where the problem is. It's getting passed to cb_summary_stats_many_cats() as a literal .x instead of a quoture. So, I think I need to add the enquo() syntax back to cb_add_summary_stats().

The fix for the group_by error was to use .data[[.x]] syntax instead of {{ .x }} syntax. I documented the solution here

dplyr Error in stop_vctrs():! x must be a vector, not a <> object when a class is added

The cb_add_summary_stats() function adds a new class to each of the summary stats data frames. The cb_summary_stats_to_ft() function uses that class to determine which method to use to make a flextable from the summary stats data frame.

When the data frame with the added class was passed to the line dplyr::mutate(across(everything(), as.character)) in cb_summary_stats_to_ft.summary_many_cats, it was returning the following error: Error instop_vctrs(): !xmust be a vector, not a <tbl_df/tbl/data.frame/make_char> object.. After Googling a little bit, I found the following in the breaking changes section of the changelog for dplyr 1.0.0:

Extending data frames requires that the extra class or classes are added first, not last. Having the extra class at the end causes some vctrs operations to fail with a message like: Input must be a vector, not a <data.frame/...> object

Adding the new class to the front of the class list fixes the problem.

Now, I think I need to add a logical vector and a pure time vector for testing.

Checking to make sure the user isn't piping the data frame into codebook

One of the codebook checks is to make sure the user doesn't pipe the data frame into the codebook function. When they do, Dataset name: in the metadata table (below) is ".". I wanted to fix that, but I don't think it's going to be possible (https://stackoverflow.com/questions/42560389/get-name-of-dataframe-passed-through-pipe-in-r). So, I updated the message to be a little more clear instead.

mbcann01 commented 2 years ago

Clean up the Word document formatting

mbcann01 commented 2 years ago

Add an example of using data imported from Stata

I think completing this will also help with #9

Haven labeled columns

When you import data from SAS, Stata, or SPSS using Haven, it adds two classes to variables with value labels: haven_labelled and vctrs_vctr. Passing these columns to codebook() results in the following error:

Error in cb_add_summary_stats(., col_nms[[i]]) : 
Column sex is of unknown type. Please set the col_type attribute

One way to get around this is simply to set the col_type attribute like this:

study <- study %>% 
  cb_add_col_attributes(sex, col_type = "Categorical")

However, because Haven labeled data is so common, we decided to specifically look for and remove those classes in cb_summary_stats.R. It should not remove those classes from the column generally -- just for the process of determining the column type and calculating descriptive statistics.

Using Haven labels

When we import data from Stata, SAS, or SPSS with labels, the attributes are called $label for variable labels and $labels for value labels. Currently, codebook() cannot automatically make use of those attributes because it only recognizes the attributes description, source, and col_type. It's relatively easy to manually set the value of the description attribute to the value of the label attribute like this

attr(study$sex, "description") <- attr(study$sex, "label") 

Which can be extended in a for loop. However, because Haven labeled data is so common, we decided to specifically look for $label and $labels in cb_get_col_attibutes.