epiverse-trace / cleanepi

R package to clean and standardize epidemiological data
https://epiverse-trace.github.io/cleanepi/
Other
8 stars 3 forks source link

convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes #163

Open avallecam opened 3 months ago

avallecam commented 3 months ago

running cleanepi::convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes

wondering if this may be an expected scenario to happen and if this may require refactoring at an appropriate time to use data.table or dtplyr.

library(rio)
library(cleanepi)
library(tidyverse)
library(tictoc)

covid <- rio::import(
  "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip",
  which = "datos_covid_LA.RDS"
) %>% 
  cleanepi::standardize_column_names()

tictoc::tic()
covid %>% 
  dplyr::select(numero_de_hospitalizaciones_recientes) %>% 
  cleanepi::convert_to_numeric(
    target_columns = "numero_de_hospitalizaciones_recientes",
    lang = "es")
#> # A tibble: 502,010 × 1
#>    numero_de_hospitalizaciones_recientes
#>                                    <dbl>
#>  1                                     0
#>  2                                     0
#>  3                                     0
#>  4                                     0
#>  5                                     0
#>  6                                     0
#>  7                                     0
#>  8                                     0
#>  9                                    NA
#> 10                                     0
#> # ℹ 502,000 more rows
#> # ℹ Use `print(n = ...)` to see more rows
tictoc::toc()
#> 150.42 sec elapsed

cc: @Joskerus @lgbermeo

Bisaloo commented 3 months ago

Could you give https://github.com/epiverse-trace/numberize/pull/14 a go please? If the performance is still not sufficient, I have a couple of other ideas.