Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

Percent class for working with percentages #127

Open Nic-Chr opened 3 months ago

Nic-Chr commented 3 months ago

Motivation

Working with percentages in R can be annoying to say the least and in day-to-day analyses I tend to find myself in this general workflow:

Having a percent class object could reduce this workflow by combining the two vectors into one, reducing the work needed to manage independent vectors.

Describe the solution you'd like It would be nice to see a percent class that represents proportions without losing precision and simply prints them as percentages. This would help analysts across PHS spend less time thinking about how to format percentages.

Describe alternatives you've considered I have made a small package that does this, see: github.com/NicChr/percent I'm aware of scales::percent() but this returns a character vector whereas as_percent() does no transformations at all, returning an object of class "percent", preserving the proportions vector and printing as a "percent" vector in tibbles.

@Tina815 @Moohan Let me know if you think this would be a good fit for phsmethods and if so I'd be happy to assist in future implementations.

If this was deemed to be a good fit, I would be happy for the code to be copied over, reducing the need for another package dependency.

Below I've included some basic examples.

Basic usage

library(remotes)
install_github("NicChr/percent", force = FALSE)
library(percent)

# Motivation --------------------------------------------------------------

### Percentage of NAs by column

## Normal workflow might look like this

library(dplyr)
na_counts <- colSums(is.na(starwars))
prop <- na_counts / nrow(starwars)
perc <- round(prop * 100, 2)
perc <- paste0(perc, "%")
names(perc) <- names(prop)
perc
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>       "0%"     "6.9%"   "32.18%"    "5.75%"       "0%"       "0%"   "50.57%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>     "4.6%"     "4.6%"   "11.49%"     "4.6%"       "0%"       "0%"       "0%"

## With `as_percent` it's a bit easier

perc2 <- as_percent(prop)
perc2
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>   "0.000%"   "6.897%"  "32.184%"   "5.747%"   "0.000%"   "0.000%"  "50.575%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>   "4.598%"   "4.598%"  "11.494%"   "4.598%"   "0.000%"   "0.000%"   "0.000%"
class(perc2)
#> [1] "percent"
unclass(perc2) # Under the hood it is just the proportions
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#> 0.00000000 0.06896552 0.32183908 0.05747126 0.00000000 0.00000000 0.50574713 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#> 0.04597701 0.04597701 0.11494253 0.04597701 0.00000000 0.00000000 0.00000000

### We can then work with the perc vector without ever needing to use prop

round(perc2, 0)
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>       "0%"       "7%"      "32%"       "6%"       "0%"       "0%"      "51%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>       "5%"       "5%"      "11%"       "5%"       "0%"       "0%"       "0%"
round(perc2, 1)
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>     "0.0%"     "6.9%"    "32.2%"     "5.7%"     "0.0%"     "0.0%"    "50.6%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>     "4.6%"     "4.6%"    "11.5%"     "4.6%"     "0.0%"     "0.0%"     "0.0%"
round(perc2, 2)
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>    "0.00%"    "6.90%"   "32.18%"    "5.75%"    "0.00%"    "0.00%"   "50.57%" 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>    "4.60%"    "4.60%"   "11.49%"    "4.60%"    "0.00%"    "0.00%"    "0.00%"

### halves are rounded up

round(percent(14.5))
#> [1] "15%"

### We can use math operations as well

# Number of NAs
nrow(starwars) * perc # This won't work
#> Error in nrow(starwars) * perc: non-numeric argument to binary operator
nrow(starwars) * perc2 # This does
#>       name     height       mass hair_color skin_color  eye_color birth_year 
#>          0          6         28          5          0          0         44 
#>        sex     gender  homeworld    species      films   vehicles  starships 
#>          4          4         10          4          0          0          0

### Usage in ggplot

library(ggplot2)
df <- starwars %>%
  count(homeworld, sort = TRUE) %>%
  mutate(homeworld = if_else(row_number() %in% 1:5, homeworld, "Other"),
         homeworld = if_else(is.na(homeworld), "Other", homeworld)) %>%
  filter(homeworld != "Other") %>%
  count(homeworld, wt = n, sort = TRUE) %>%
  mutate(homeworld = factor(homeworld, levels = unique(homeworld))) %>%
  mutate(perc = as_percent(n/sum(n))) %>%
  arrange(desc(perc))
df
#> # A tibble: 4 × 3
#>   homeworld     n perc     
#>   <fct>     <int> <percent>
#> 1 Naboo        11 40.741%  
#> 2 Tatooine     10 37.037%  
#> 3 Alderaan      3 11.111%  
#> 4 Coruscant     3 11.111%

### Pie chart

df %>%
  ggplot(aes(x = 1, y = perc, fill = homeworld)) + 
  geom_col() +
  scale_y_continuous(labels = as_percent) +
  coord_polar(theta = "y") +
  geom_text(aes(label = perc),
            position = position_stack(vjust = 0.5),
            size = 3) +
  theme_void(base_size = 12) + 
  labs(title = "Pie-chart of top 5 most common starwars planets")
#> Don't know how to automatically pick scale for object of type <percent>.
#> Defaulting to continuous.

Created on 2024-03-07 with reprex v2.0.2

Moohan commented 2 days ago

This was agreed to take forward as a PR

Nic-Chr commented 1 day ago

There are a few things I'm not sure about regarding the implementation from a user-perspective.

  1. Right now I have 2 functions, percent() and as_percent(). percent() simply creates a percent vector from percentage inputs, e.g. 100 becomes 100%. as_percent() converts proportions to percentages. It's not clear to me which is more intuitive from a user-friendly perspective and if we should just use 1 or both or something a bit different?
  2. When doing any kind of math involving percent vectors, what do we think is the most logical or expected outcome? For example, to me it would seem sensible to return a percent vector when two percent vectors are multiplied. When one is a percent vector, and the other is a numeric vector the outcome is a bit less trivial. Right now my implementation always returns a percent vector in this case but it might make more sense to depend on the order of classes such that if the LHS is a percent and RHS is not, then result is a percent. Likewise is the LHS is not a percent and RHS is a percent, then the result should be a numeric.
  3. Should as.character.percent() apply rounding by default? The reason I opted for this is because it plays nicely with ggplot2 which relies on calling as.character in plots, which makes things easier to read. On the other hand a user might expect to see all the underlying digits when using as.character.percent().
  4. A similar concern as 3, format.percent() by default applies decimal digit rounding instead of the usual significant digit rounding that format() uses. This is because imo decimal rounding is much nicer for percentages generally and hence is more useful for users. A solution would be to just make this distinction clear in the documentation.