ddsjoberg / gtsummary

Presentation-Ready Data Summary and Analytic Result Tables
http://www.danieldsjoberg.com/gtsummary
Other
1.05k stars 124 forks source link

Bug Report: In tbl_summary and others, unescaped HTML in the data frame gets incorporated into the table #1307

Closed erikvona closed 2 years ago

erikvona commented 2 years ago

I want to use tbl_summary in a Shiny app, where a user is allowed to define groups.

However, any HTML that is in the grouping data gets output as unescaped HTML, allowing for weirdness and XSSing. Also, breaking the markdown is easy, even accidentally.

For example:

trial %>% 
  select(age, grade, response, trt) %>%
  mutate(trt = paste("<p onmouseover=\"alert(1)\">", trt, "</p>")) %>%
  tbl_summary(by = trt)

Outputs a normal looking table, but if you move your mouse over the group, some Javascript is ran: image

I could make a pull request that runs the content through htmltools::htmlEscape if you want. That still allows the user to accidentally break the markdown, but avoids any probability of XSS. I could also fix markdown breaking by bolding with gt::html and <b> instead of gt::md and **, programmatic markdown escaping is a bit of a disaster so outside of my abilities (markdown now breaks on trailing spaces, leading spaces, asterisks and more).

Alternatively, you could clearly document that table contents can be interpreted as HTML in some scenarios so should be escaped by the user/developer.

ddsjoberg commented 2 years ago

To use gtsummary tables in Shiny, use as_gt(), then the Shiny render functions exported from gt. Is that what you're doing already? Do you have a small example you can make public?

ddsjoberg commented 2 years ago

Does this address your issue? https://stackoverflow.com/questions/64197029

erikvona commented 2 years ago

The actual application unfortunately has several thousands of lines of codes, depends on the user to provide a database and requires the user to provide a group definition based on the data, but here's a minimal example of the problem:

library(shiny)
library(gt)
library(gtsummary)

ui <- fluidPage(
  textInput("level1_name", "Name of level 1", "Level 1"),
  textInput("level2_name", "Name of level 2", "Level 2"),
  gt_output("summary")
)

server <- function(input, output){
  my_numbers <- runif(100)
  my_levels <- sample(1:2, 100, T)
  sample_data <- reactive({
    data.frame(numbers = my_numbers, levels = factor(my_levels, levels = 1:2, labels = c(input$level1_name, input$level2_name)))
  })
  output$summary <- render_gt({
    tbl_summary(sample_data(), by = levels) %>%
      as_gt()
  })
}
shinyApp(ui = ui, server = server)

Now, enter <p onmouseover="alert(1)">Level 1</p> in the Name of level 1 inputbox. You'll see that when moving your mouse over level 1, you get a Javascript alert.

I'm actually doing something that's a bit different (having a standardized function that can render both to Shiny to view and to an R markdown to save so analyses can be saved exactly as they are seen, by extracting and rendering dependencies, then inserting the table through insertUI), but that's not relevant for this problem.

ddsjoberg commented 2 years ago

Can you please test whether this is an issue with printing gtsummary tables or gt tables? Print a data frame with gt. I don't think what you're seeing is related to gtsummary 🤔

erikvona commented 2 years ago

I think it's related to gtsummary. gtsummary is marking the column names with gt::md, which causes gt to first parse the markdown to HTML, then output the HTML without escaping (this is documented) to bold them.

As said (perhaps a bit unclear) in the initial reports, I think you should pass all user input through htmltools::htmlEscape before passing it to gt::md, or be very explicit that it will be output without escaping.

An extended example;

library(shiny)
library(gt)
library(gtsummary)
library(dplyr)
library(tidyr)

ui <- fluidPage(
  textInput("level1_name", "Name of level 1", "Level 1"),
  textInput("level2_name", "Name of level 2", "Level 2"),
  gt_output("summary"),
  gt_output("ungrouped"),
  gt_output("manually"),
  gt_output("manually_bolded")
)

server <- function(input, output){
  my_numbers <- runif(100)
  my_levels <- sample(1:2, 100, T)
  sample_data <- reactive({
    data.frame(numbers = my_numbers, levels = factor(my_levels, levels = 1:2, labels = c(input$level1_name, input$level2_name)))
  })
  output$summary <- render_gt({
    tbl_summary(sample_data(), by = levels) %>%
      as_gt()
  })
  output$ungrouped <- render_gt({
    tbl_summary(sample_data()) %>%
      as_gt()# Note: levels are escaped here
  })
  manually_calculate <- reactive({
    sample_data() %>%
      group_by(levels) %>%
      summarize(across(where(is.numeric), \(x) {
          quantiles <- quantile(x, probs = c(0.5, 0.25, 0.75))
          glue::glue("{style_number(quantiles[1],2)} ({style_number(quantiles[2],2)} - {style_number(quantiles[3],2)})")
        })
        )%>%
      tidyr::pivot_wider(names_from = levels, values_from = -levels)
  })

  output$manually <- render_gt({
    manually_calculate() %>%
      gt()
  })
  output$manually_bolded <- render_gt({
    manually_calculate() %>%
      gt() %>%
      cols_label(.list = lapply(colnames(manually_calculate()), \(x) md(glue::glue("**{htmltools::htmlEscape(x)}**"))) %>% setNames(colnames(manually_calculate())))
  })
}
shinyApp(ui = ui, server = server)

As seen, the HTML is only output after being put in the column names (which are marked with md), and if we do it manually by first escaping, then bolding, the tags are properly encoded and the Javascript is visible to the user instead of running. If we'd remove the htmltools::htmlEscape in the manually_bolded example, we'd have the XSS risk but would pass user input to a function that's documented to output HTML without escaping.

ddsjoberg commented 2 years ago

Can you use modify_header(text_interpret = "html") ? This will change the default gt::md() to gt::html(). But you'll need to redefine the column headers where this is needed?

Alternatively, you can access the internals of gtsummary x$table_styling$header and change the text interpreter to htmltools::htmlEscape()

erikvona commented 2 years ago

Here are a couple of experiments:

library(gtsummary)
erroring_table <- trial %>% 
  select(age, grade, response, trt) %>%
  mutate(trt = paste("<p onmouseover=\"alert(1)\">", trt, "</p>")) %>%
  tbl_summary(by = trt)

erroring_table %>%
  modify_header(text_interpret = "html")
# Unchanged, still contains unescaped JS

erroring_table$table_styling$header$interpret_label <- "gt::html"
erroring_table
# Now we're no longer interpreting the markdown, so get asterixes in the header, but the JS still executes 

erroring_table$table_styling$header$interpret_label <- "htmltools::htmlEscape"
erroring_table
# Now we're escaping twice, so < becomes &gt; in the header

erroring_table$table_styling$header$interpret_label <- "force"
erroring_table
# Force does nothing, so now < stays < and gets escaped, but the labels are not bold and contain asterixes

md_no_html<- function(x) gt::md(htmltools::htmlEscape(x))

erroring_table$table_styling$header$interpret_label <- "md_no_html"
erroring_table
# Works 🙂 a bit hacky though
# Advantage: easy to implement, replace all gt::md with md_no_html in package

trial %>% 
  select(age, grade, response, trt) %>%
  mutate(trt = paste("<p onmouseover=\"alert(1)\">", trt, "</p>")) %>%
  tbl_summary(by = trt) %>%
  modify_header(
    label = paste0("**", gtsummary:::translate_text("Characteristic"), "**"),
    all_stat_cols() ~
      ifelse(is.null(by),
             "**N = {style_number(N)}**",
             "**{trimws(htmltools::htmlEscape(level))}**, N = {style_number(n)}")
  )
# Works well 😊 substantially less hacky imo
# Can be implemented in tbl_summary (replace existing modify_header call with this), would be my suggested fix
# trimws to fix markdown breaking on trailing and leading spaces 

I think if a fix were to be implemented, I'd prefer the last one, escaping the HTML whenever user input gets stored somewhere where it will be output as HTML. The md_no_html option would be easier to implement, though, and I could also imagine it's a good option, the double-barreled functionality of gt::md is a bit strange and dangerous.

I realize I'm taking up some of your time, so if you want me to implement it (and check all the other functions, I believe many are affected), then submit a pull request, I'm all up for it. I'm really grateful for your work so I certainly don't want to burden you with things I can do as well.

ddsjoberg commented 2 years ago

gtsummary tables are exported to HTML, Word, RTF, PDF, pptx and more. I am not exactly sure what you're proposing to implement, but we cannot assume anywhere that HTML is the output type.

Also, can you try this first option again? But this time you need to redefine the headers, rather than just run modify_header(text_interpret = "html").

library(gtsummary)
erroring_table <- trial %>% 
  select(age, grade, response, trt) %>%
  mutate(trt = paste("<p onmouseover=\"alert(1)\">", trt, "</p>")) %>%
  tbl_summary(by = trt)

erroring_table %>%
  modify_header(all_stat_cols() ~ "{level}", text_interpret = "html")
erikvona commented 2 years ago

The option you've shared unfortunately still allows the Javascript to run (labels are no longer bold and the header says **Characteristic**).

Other outputs do make it difficult. Perhaps in as_gt, you could htmlEscape content marked with gt::md, which is not the most elegant solution but can certainly work. Strings that went through gt::md can be detected since they have from_markdown as their class.

Alternatively, we could define an output-dependent formatting function. That would also open up the possibility to more easily bold in different output formats by working with formatting attributes, instead of formatting through markdown, then detecting and removing that markdown in some formats. However, that does sound like a substantial amount of work.

The code you've suggested unfortunately only removes bolding, it still allows the Javascript to run. As far as I understand, gt does not offer a way to parse columns for markdown but still escape HTML, it's either both (so escaping is needed to prevent HTML/JS injection), only allow HTML (escape still needed and no formatting), or neither (formatting doesn't work).

There's also the possibility to document this behavior and leave it as-is, since this is a lot more complicated than I thought. For my project, I can just htmlEscape grouping columns before feeding the data to tbl_summary, of course.

ddsjoberg commented 2 years ago

Thank you for the thorough response. I'll go ahead and close this issue for now, and think about how this could be documented.

🌴