epiverse-trace / numberize

Convert words to numbers in English, French and Spanish
https://epiverse-trace.github.io/numberize/
Other
4 stars 2 forks source link

Improve performance #14

Closed Bisaloo closed 1 week ago

Bisaloo commented 2 months ago

Should fix https://github.com/epiverse-trace/cleanepi/issues/163 for the most part. In total, this allows a ~20x speedup.

I'll likely propose a C implementation of number_from() in a couple of weeks for fun and as an exercise. It won't need to be merged and can live as an alternative branch for users with high performance need & a C compiler.

main version

library(rio)
library(cleanepi)

covid <- rio::import(
  "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip",
  which = "datos_covid_LA.RDS"
) |> 
  cleanepi::standardize_column_names()
#> Warning: The `trust` argument of `import()` should be explicit for serialization formats
#> as of rio 1.0.3.
#> ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
#> ℹ The deprecated feature was likely used in the rio package.
#>   Please report the issue at <https://github.com/gesistsa/rio/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

tictoc::tic()
test <- numberize::numberize(covid$numero_de_hospitalizaciones_recientes, lang = "es")
tictoc::toc()
#> 73.346 sec elapsed

Created on 2024-07-27 with reprex v2.1.1

proposed version

library(rio)
library(cleanepi)

covid <- rio::import(
  "https://raw.githubusercontent.com/Joskerus/Enlaces-provisionales/main/data_limpieza.zip",
  which = "datos_covid_LA.RDS"
) |> 
  cleanepi::standardize_column_names()
#> Warning: The `trust` argument of `import()` should be explicit for serialization formats
#> as of rio 1.0.3.
#> ℹ Missing `trust` will be set to FALSE by default for RDS in 2.0.0.
#> ℹ The deprecated feature was likely used in the rio package.
#>   Please report the issue at <https://github.com/gesistsa/rio/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

tictoc::tic()
test <- numberize::numberize(covid$numero_de_hospitalizaciones_recientes, lang = "es")
tictoc::toc()
#> 3.088 sec elapsed

Created on 2024-07-27 with reprex v2.1.1

TimTaylor commented 2 months ago

I wonder if you can also get a little speed up by switching the ordering here to check the 100 digit branch first? This likely depends on your inputs but I think you're likely to generally have more 100 digits so you can break out of the loop quicker. It may also be worth being "cute" and avoiding the ifelse() call (actually the ifelse() should probably be an if then else which will probably speed things up as is).

https://github.com/epiverse-trace/numberize/blob/dcd2d2ea448d086773b0a185412f9a87292ebe8f/R/numberize.R#L63-L72

Changing the above to something like

for (d in digits) {
    if (d == 100) {
      summed <- d + d * (summed -1 ) * (summed != 0)
    } else if (d %in% c(1E3, 1E6, 1E9, 1E12)) {
      total <- total + d + d * (summed - 1) * (summed != 0)
      summed <- 0
    } else {
      summed <- summed + d
    }
  }

I didn't benchmark the dataset above, just some toy examples locally where it seemed to help. It feels like you could do some more complicated vectorisation as well but would need to ponder that some more and may not be worth the effort.

avallecam commented 2 months ago

Should fix epiverse-trace/cleanepi#163 for the most part. In total, this allows a ~20x speedup.

👍 running the proposed version took me 4.25 sec elapsed. I would only like to note that after installing pak::pak("epiverse-trace/numberize@improve-performance"), I needed to restart R to make this installed branch work.

Bisaloo commented 1 month ago

@bahadzie, would you be able to take care of the conflicts in this PR? Let me know if you've tried and gotten stuck, and I can give it a look myself.

bahadzie commented 4 weeks ago

@Bisaloo there a lot of other changes that were implemented as a result of the full package review during CRAN submission that is causing 6 tests currently in main to fail.

In my opinion our options are

  1. rebase the PR on main and fix the merge conflicts
  2. implement the performance improvements from a new PR branching of main

What's your preference?

Bisaloo commented 4 weeks ago

In my opinion our options are

  1. rebase the PR on main and fix the merge conflicts

  2. implement the performance improvements from a new PR branching of main

What's your preference?

If you're taking care of it, I don't have a preference. Both solutions are valid and will lead to the same result.

bahadzie commented 4 weeks ago

I will take care of it.

Bisaloo commented 1 week ago

Superseded by #15