Open Nic-Chr opened 2 months ago
I'm always keen on speed/memory improvements (especially if they work on larger datasets). My usual approach when I've made changes like this in the past to other functions is to start by expanding the tests for the function(s) so I'm 100% sure there's no unintended regressions or behaviour changes.
I've also been interested in https://lorenzwalthert.github.io/touchstone/ for a while which is meant for exactly these types of improvements - it involves a bit of setup but then you get a benchmark comment added to PRs.
Hi, I think we can improve the speed of
create_age_groups
quite a bit and also remove the dependency on 'utils' package.If we avoid
cut()
which is inefficient in creating factors as it goes through unnecessaryunique()
+match()
steps internally. We already have our cleaned age breaks which are unique and sorted, meaning we can avoid usingcut()
and directly use.bincode()
..bincode()
is basically a low-level factor constructor and also whatcut()
uses as well. To get a character vector, all that's needed is to subset our age breaks onto our bin codes.On the topic of
cut()
inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improvedProposed function and benchmark:
Created on 2024-09-19 with reprex v2.0.2
This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.