EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

Performance improvement for pad() #89

Open joranE opened 1 year ago

joranE commented 1 year ago

I was profiling some code that calls padr::pad() and found that span_all_groups() was using a significant chunk of the total time. This proposed change converts span_all_groups() to use purrr::map2 and tidyr::unnest() to accomplish the same thing. In my tests it is ~2x faster than the original span_all_groups() on my M1 Max Apple laptop and ~3x faster on the virtual Linux machine I use for work. It's a small absolute improvement on my Mac (<0.5sec) but a more noticeable improvement on my slower work Linux machine (~1.5sec), at least in my context, where I'm doing some data processing while loading a Shiny app.

A tradeoff is the additional imports of purrr::map2 and tidyr::unnest. I noticed that padr seems to stick to base R as much as possible, so I'll understand if you're reluctant to add the additional dependencies.

Unrelated, but I think the existing stop_int64() call isn't actually doing anything, since it's being called on the list output from split() and so it's looking at the classes of the individual list elements rather than the columns in the original data frame.

All existing tests in the package are passing with this change.

EdwinTh commented 1 year ago

Thanks a lot for taking the time to seek improvements in padr. For the moment I am pressed for time. Once I have the resources to give padr some love again I will study your proposal.