cmu-delphi / epiprocess

Tools for basic signal processing in epidemiology
https://cmu-delphi.github.io/epiprocess/
Other
13 stars 8 forks source link

GCD calculation #426

Open dajmcdon opened 8 months ago

dajmcdon commented 8 months ago

I'm not entirely sure what the use case for gcd_num() is, but is it worth implementing differently? We don't currently import Rcpp (which the second version uses):

d <- sample(1:1000, 1000, TRUE)

microbenchmark::microbenchmark(
  epiprocess = epiprocess:::gcd_num(d),
  rtestim = rtestim:::gcd(d)
)
#> Warning in microbenchmark::microbenchmark(epiprocess = epiprocess:::gcd_num(d),
#> : less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#>        expr       min         lq      mean    median       uq      max neval
#>  epiprocess 12249.816 12346.0635 17960.673 12423.984 14510.54 484596.5   100
#>     rtestim    19.024    24.2515  5672.601    34.973    51.66 563304.5   100

Created on 2024-03-06 with reprex v2.1.0

dsweber2 commented 8 months ago

That mean median comparison though.

When I search for gcd_num, the only time it shows up is in guess_period, where it is handed a list of diffs between dates, converted to numeric. So I'd expect it to run once per loop of epix_process, which doesn't seem likely to eat too much time.

The lack of tests may mean there's a bug lurking, so that would be a potential gain from switching over to a non-DIY implementation.

brookslogan commented 4 months ago

Tempting speedups and testedness, but Rcpp also means potential extra installation hassle for some users and potential license headaches (assuming we don't already transitively depend on Rcpp). Without evidence that this impacts epix_slide() performance significantly or of a bug, I'm not sure whether we should proceed.

brookslogan commented 4 months ago

We may not even want to have a GCD algorithm at all. It seems like if the guessed period doesn't appear in the diffs of the unique times, then we have some really weird/wrong data. And if we make that an error, then we can move from calculating the GCD to taking the smallest diff and simply checking that it is a CD.

brookslogan commented 4 months ago

Actually, further, we might not need a GCD at all. I'm not sure why I used this for epix_slide default ref_time_values (i.e. versions).

And we're not using GCD in guess_time_type() either.

[But we might actually be missing a guess_period() in next_after() impls... though that mainly only messes up.... guess_period() later, and trying to guess the period to create new times and being wrong may go south when using time types that don't enforce the period.]

dshemetov commented 3 months ago

assuming we don't already transitively depend on Rcpp

We do. I wonder if tsibble is pushed into GPL because of Rcpp.

r$> pak::pkg_deps_explain("cmu-delphi/epiprocess", "Rcpp")
epiprocess -> tsibble -> anytime -> Rcpp                                   
brookslogan commented 1 month ago

(Side note: we could reduce runtime by 60% by using base R for checks instead of checkmate, and 70%+ by omitting some checks if possible. Still not sure we even want gcd at all anymore though. But this brings to attention that we should take care about validation in inner loops.)