gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
209 stars 28 forks source link

`survey_count()` is very slow when applied on a lot of groups #164

Open etiennebacher opened 1 year ago

etiennebacher commented 1 year ago

Hello, I have some survey data that contains a few tens of thousands of observations, and I have several groups. I'd like to compute the survey count per combination of groups, but it is much slower than the "non-survey" count using dplyr. I understand that you have to do extra steps and that you call survey under the hood, but it seems to me that this difference in timing is due to the way groups of data are passed to survey_total().

In the example below, dplyr::count() is near instantaneous, but survey_count() takes more than a minute:

library(srvyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

packageVersion("srvyr")
#> [1] '1.2.0'

N <- 50000
set.seed(123)

test <- data.frame(
  grp1 = sample(letters, N, TRUE),
  grp2 = sample(LETTERS, N, TRUE),
  grp3 = sample(1:10, N, TRUE),
  weight = sample(seq(0, 1, 0.01), N, TRUE)
) |> 
  arrange(grp1, grp2, grp3)

# dplyr is fast  
system.time({
  test |> 
    group_by(grp1, grp2, grp3) |> 
    count() |> 
    ungroup()
})
#>    user  system elapsed 
#>    0.03    0.00    0.04

test_sv <- as_survey_design(test, weights = weight)

# srvyr is much slower
system.time({
  test_sv |> 
    group_by(grp1, grp2, grp3) |> 
    survey_count() |> 
    ungroup()
})
#>    user  system elapsed 
#>   81.16    2.71   84.76

Is it something that could be improved? Or maybe I missed something?

Thanks,

bschneidr commented 11 months ago

I think it's a good idea to look into ways to speed up grouped operations, if possible. But fundamentally, the srvyr code is doing many, many more calculations than the 'dplyr' code. In 'srvyr', you're computing ~ 7,000 point estimates and then you're also computing estimated standard errors for those point estimates. The calculation of the standard errors is the hard part that requires a lot of calculation; getting the point estimates is easy.

Is there something you noticed inside the 'srvyr' code that you think is making it especially slow?