co-analysis / a11ytables

R package: generate best-practice stats spreadsheets for publication
https://co-analysis.github.io/a11ytables/
Other
42 stars 3 forks source link

Getting 'number stored as text' error in mixed columns #93

Open cat-hand opened 1 year ago

cat-hand commented 1 year ago

I'm exporting rounded and suppressed numbers and am getting the 'number stored as text' error in all of my numeric columns, even for columns where there are no rounded or suppressed numbers. I tried using numFmt = "#,##0" in the openxlsx createStyle function and it hasn't helped.

It looks like it might be an openxlsx issue and seems like it's solved in openxlsx2.

Can anyone help?

matt-dray commented 1 year ago

Hi Cat, thanks for highlighting this. Are you able to provide a reproducible example of the issue?

I've created a small example below that generates a workbook with a table (in sheet 'Table_1') that has two columns: one column (mpg) that's numeric but has a suppressed value ('[c]') and one column (cyl) that is fully numeric. In the attached screenshot, you can see (as expected) the 'number stored as text' warning is in column mpg, but there's no warning in cyl. Do you get the same result?

(Sidenote: as it happens, I may be converting {a11ytables} to use {openxlsx2} instead of {openxlsx} in future (see #70). This issue may be another reason to make the change.)

# Prepare supporting tables
support_tables <- list(
  cover_df = tibble::tribble(
    ~subsection_title, ~subsection_content,
    "Purpose", "Example results for something.",
    "Workbook properties", "Some placeholder information.",
    "Contact", "Placeholder email"
  ),
  contents_df = tibble::tribble(
    ~"Sheet name", ~"Sheet title",
    "Notes", "Notes",
    "Table_1", "Example sheet title"
  ),
  notes_df = tibble::tribble(
    ~"Note number", ~"Note text",
    "[note 1]", "Placeholder note.",
    "[note 2]", "Placeholder note."
  )
)

# Prepare example dataset
table_df <- mtcars
table_df[["car [note 1]"]] <- row.names(mtcars)  # row names to named column
row.names(table_df) <- NULL  # remove row names
table_df <- table_df[1:5, c("car [note 1]", "mpg", "cyl")]  # add note to header
table_df[2, "mpg"] <- "[c]"  # make a value confidential in the mpg column
table_df["Notes"] <- c("[note 2]", rep(NA_character_, 4))  # include note column

# Create new a11ytable
my_a11ytable <-
  a11ytables::create_a11ytable(
    tab_titles = c(
      "Cover",
      "Contents",
      "Notes",
      "Table_1"
    ),
    sheet_types = c(
      "cover",
      "contents",
      "notes",
      "tables"
    ),
    sheet_titles = c(
      "Cover title (example)",
      "Contents",
      "Notes",
      "Example sheet title"
    ),
    blank_cells = c(
      NA_character_,
      NA_character_,
      NA_character_,
      "Blank cells mean that a row does not have a note."
    ),
    sources = c(
      NA_character_,
      NA_character_,
      NA_character_,
      "Example source."
    ),
    tables = list(
      support_tables$cover_df,
      support_tables$contents_df,
      support_tables$notes_df,
      table_df
    )
  )

# Generate workbook from a11ytable
my_wb <- a11ytables::generate_workbook(my_a11ytable)

# Preview output
openxlsx::openXL(my_wb)

image

cat-hand commented 1 year ago

Thanks for the quick reply Matt! I think the problem is that all my 'numeric' columns are actually text by the time I create my a11ytable - I have a function which rounds and suppresses all of the figures, so they are all stored as strings after using that function. I read that openxlx bases the excel column type on the column type in the input data, so I'm assuming that this is what is causing the error.

Was hoping there was some kind of workaround but I guess not haha. I have loads of tables in my publication so getting rid of the error manually isn't awesome. openxlsx2 sounds very promising though!

For info, this is essentially how that function works:

library(dplyr)
table_df <- mtcars %>%
  mutate(across(where(is.numeric), ~case_when(
    # in this example, figures less than 5 are suppressed
    .x < 5 ~ "[c]",
    # figures less than 6 are 'unreliable' so are shown but marked with 'u'
    .x < 6 ~ paste(janitor::round_half_up(.x, 1), "[u]"),
    TRUE ~ as.character(janitor::round_half_up(.x, 0)))))
matt-dray commented 1 year ago

Okey dokey, I think I understand: columns get converted to character class even if they remain unsuppressed. I think there's a few ways to convert numbers-only columns back to numeric en masse. Below is some demo code that uses type.convert() to do this (see how numeric column y gets converted to character in table_supp, but then ends up as numeric at the end). Let me know if I haven't quite understood.

suppressPackageStartupMessages(library(tidyverse))
library(janitor, warn.conflicts = FALSE)

set.seed(1337)

tbl <- tibble(
  id = paste0("id_", 1:10),
  x  = runif(10, 0, 10),  # will contain values to be suppressed
  y  = runif(10, 6, 10)   # won't need any suppression
)

tbl
#> # A tibble: 10 × 3
#>    id        x     y
#>    <chr> <dbl> <dbl>
#>  1 id_1  5.76   9.92
#>  2 id_2  5.65   9.97
#>  3 id_3  0.740  9.31
#>  4 id_4  4.54   6.78
#>  5 id_5  3.73   9.93
#>  6 id_6  3.31   6.10
#>  7 id_7  9.48   9.89
#>  8 id_8  2.81   9.70
#>  9 id_9  2.45   7.36
#> 10 id_10 1.46   6.99

tbl_supp <- tbl |> 
  mutate(
    across(
      where(is.numeric),
      \(value) case_when(
        value < 5 ~ "[c]",                                  # suppress
        value < 6 ~ paste(round_half_up(value, 1), "[u]"),  # 'unreliable' flag
        TRUE      ~ as.character(round_half_up(value, 0))   # otherwise round
      )
    )
  )

tbl_supp  # note 'y' is character but contains only numbers
#> # A tibble: 10 × 3
#>    id    x       y    
#>    <chr> <chr>   <chr>
#>  1 id_1  5.8 [u] 10   
#>  2 id_2  5.6 [u] 10   
#>  3 id_3  [c]     9    
#>  4 id_4  [c]     7    
#>  5 id_5  [c]     10   
#>  6 id_6  [c]     6    
#>  7 id_7  9       10   
#>  8 id_8  [c]     10   
#>  9 id_9  [c]     7    
#> 10 id_10 [c]     7

type.convert(tbl_supp, as.is = TRUE)  # note 'y' is now numeric
#> # A tibble: 10 × 3
#>    id    x           y
#>    <chr> <chr>   <int>
#>  1 id_1  5.8 [u]    10
#>  2 id_2  5.6 [u]    10
#>  3 id_3  [c]         9
#>  4 id_4  [c]         7
#>  5 id_5  [c]        10
#>  6 id_6  [c]         6
#>  7 id_7  9          10
#>  8 id_8  [c]        10
#>  9 id_9  [c]         7
#> 10 id_10 [c]         7
cat-hand commented 1 year ago

Thank Matt, I didn't know about that function 😃 will give that a go

jake-pitchers commented 3 months ago

Hi there,

Have been using this package for a while now. I was wondering whether there was any plans to have it so columns with both strings and numbers could be stored as numbers where appropriate. Alternatively is there is a neat way to do it with openxslx that would be great to see :)

Thanks,

Jake

cat-hand commented 3 months ago

@jake-pitchers I have a temporary fix for this, stole it from the rapid.spreadsheets package:

#' Fixes column where numbers are stored as text
#'
#' Overwrites selected columns and rows that contain both numeric and character
#' elements. Once this is run, you should add a new style to the modified
#' columns and rows so that they are displayed correctly in the output file. Use
#' [openxlsx::addStyle()] and [get_cell_style()] to do this.
#'
#' @importFrom openxlsx writeData
#' @importFrom dplyr pull if_else
#' @importFrom stringr str_remove_all str_detect
#'
#' @param excel_wb Openxlsx workbook name
#' @param sheet Worksheet (either name as string, or location as numeric)
#' @param cols Vector of column numbers to be overwritten
#' @param rows Vector of row numbers to be overwritten
#' @param df Data frame containing the data from the relevant worksheet
#'
#' @return Updated workbook with modified columns
#'
#' @export

overwrite_num_cols <- function(excel_wb, sheet, cols, rows, df) {

  lapply(seq_along(cols), \(col) {

    full_col <- pull(df[cols], col)
    # Only convert numbers to numeric if they aren't marked with [u]
    full_col_num <- str_remove_all(full_col, "(,|%)(?!.*[u])")

    lapply(seq_along(rows), \(row) {

      # If the cell contains a character (e.g. [c]), return the character value
      if (is.na(suppressWarnings(as.numeric(full_col_num[[row]])))) {

        new_value <- full_col_num[[row]]

      } else {

        # If the cell contains a number, return the numeric value (if it's a
        # percentage, divide the value by 100)
        new_value <- if_else(any(str_detect(full_col, "%")),
                             as.numeric(full_col_num[[row]]) / 100,
                             as.numeric(full_col_num[[row]]))

      }

      openxlsx::writeData(excel_wb, sheet, new_value,
                          startCol = cols[col],
                          startRow = (row - 1) + rows[1])

    })
  })
}