Closed Bisaloo closed 1 week ago
I wonder if you can also get a little speed up by switching the ordering here to check the 100 digit branch first? This likely depends on your inputs but I think you're likely to generally have more 100 digits so you can break out of the loop quicker. It may also be worth being "cute" and avoiding the ifelse()
call (actually the ifelse()
should probably be an if
then else
which will probably speed things up as is).
Changing the above to something like
for (d in digits) {
if (d == 100) {
summed <- d + d * (summed -1 ) * (summed != 0)
} else if (d %in% c(1E3, 1E6, 1E9, 1E12)) {
total <- total + d + d * (summed - 1) * (summed != 0)
summed <- 0
} else {
summed <- summed + d
}
}
I didn't benchmark the dataset above, just some toy examples locally where it seemed to help. It feels like you could do some more complicated vectorisation as well but would need to ponder that some more and may not be worth the effort.
Should fix epiverse-trace/cleanepi#163 for the most part. In total, this allows a ~20x speedup.
👍 running the proposed version took me 4.25 sec elapsed
. I would only like to note that after installing pak::pak("epiverse-trace/numberize@improve-performance")
, I needed to restart R to make this installed branch work.
@bahadzie, would you be able to take care of the conflicts in this PR? Let me know if you've tried and gotten stuck, and I can give it a look myself.
@Bisaloo there a lot of other changes that were implemented as a result of the full package review during CRAN submission that is causing 6 tests currently in main to fail.
In my opinion our options are
What's your preference?
In my opinion our options are
rebase the PR on main and fix the merge conflicts
implement the performance improvements from a new PR branching of main
What's your preference?
If you're taking care of it, I don't have a preference. Both solutions are valid and will lead to the same result.
I will take care of it.
Superseded by #15
Should fix https://github.com/epiverse-trace/cleanepi/issues/163 for the most part. In total, this allows a ~20x speedup.
I'll likely propose a C implementation of
number_from()
in a couple of weeks for fun and as an exercise. It won't need to be merged and can live as an alternative branch for users with high performance need & a C compiler.main
versionCreated on 2024-07-27 with reprex v2.1.1
proposed version
Created on 2024-07-27 with reprex v2.1.1