Open avallecam opened 3 months ago
running cleanepi::convert_to_numeric() in a dataset of 500,000+ rows took 2.5 minutes
cleanepi::convert_to_numeric()
wondering if this may be an expected scenario to happen and if this may require refactoring at an appropriate time to use data.table or dtplyr.
library(rio) library(cleanepi) library(tidyverse) library(tictoc) covid <- rio::import( "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip", which = "datos_covid_LA.RDS" ) %>% cleanepi::standardize_column_names() tictoc::tic() covid %>% dplyr::select(numero_de_hospitalizaciones_recientes) %>% cleanepi::convert_to_numeric( target_columns = "numero_de_hospitalizaciones_recientes", lang = "es") #> # A tibble: 502,010 × 1 #> numero_de_hospitalizaciones_recientes #> <dbl> #> 1 0 #> 2 0 #> 3 0 #> 4 0 #> 5 0 #> 6 0 #> 7 0 #> 8 0 #> 9 NA #> 10 0 #> # ℹ 502,000 more rows #> # ℹ Use `print(n = ...)` to see more rows tictoc::toc() #> 150.42 sec elapsed
cc: @Joskerus @lgbermeo
Could you give https://github.com/epiverse-trace/numberize/pull/14 a go please? If the performance is still not sufficient, I have a couple of other ideas.
running
cleanepi::convert_to_numeric()
in a dataset of 500,000+ rows took 2.5 minuteswondering if this may be an expected scenario to happen and if this may require refactoring at an appropriate time to use data.table or dtplyr.
cc: @Joskerus @lgbermeo