ctmm-initiative / ctmm

Continuous-Time Movement Modeling. Functions for identifying, fitting, and applying continuous-space, continuous-time stochastic movement models to animal tracking data.
http://biology.umd.edu/movement.html
47 stars 12 forks source link

add memoise to langevin #59

Closed katrinabrock closed 2 months ago

katrinabrock commented 3 months ago

Using memoise function to avoid duplicate calls to langevin function by saving the results in memory and reusing them. I validated the speedup by running a few ctmm.fit from the vignettes. I ran the same code 10x with the mainline code and with this version and saved the results of Rprof.

Here are the results. Total time is in seconds of cpu time. lengevin_percent is the percent of cpu time spend within the langevin function or functions it calls. Gains are modest in terms of total time(2% lower mean, 5% lower median), but look real.

This change might be unnecessary if you're removing all unnecessary calls to langevin in a different way.

total_time

langevin_percent

Here's the CSV of the data from above charts: 20240701_memoise_bench.csv

Here's the code that generated the data:

system('git checkout master')
devtools::load_all()
data("buffalo")

Cilla <- buffalo$Cilla

m.iid <- ctmm(sigma=23 %#% "km^2")
m.ou <- ctmm(sigma=23 %#% "km^2",tau=6 %#% "day")
m.ouf <- ctmm(sigma=23 %#% "km^2",tau=c(6 %#% "day",1 %#% "hour"))

TEST <- FALSE
for(i in 1:40){
    branch <- c('master', 'memoise')[(i %% 2) + 1]
    system(paste0('git checkout ', branch))
    source('R/1.R')
    devtools::load_all()
    if(TEST) ctmm.fit <- function(...){rnorm(10000)}
    Rprof(paste(branch, i, 'Rprof.out', sep='.'))
    M.IID <- ctmm.fit(Cilla,m.iid)
    M.OU <- ctmm.fit(Cilla,m.ou)
    M.OUF <- ctmm.fit(Cilla,m.ouf)
    Rprof(NULL)
    memoise::forget('ctmm:::langevin')
}

# Analyse 
library(data.table)

n <- 40
results <- data.table(
    idx = rep(0, n),
    branch = rep('', n),
    run_total_time = rep(0, n),
    pct_langevin = rep(0,n)
)
for(i in 1:n){
    branch <- c('master', 'memoise')[(i %% 2) + 1]
    filename <- paste(branch, i, 'Rprof.out', sep='.')
    summary <- summaryRprof(filename)
    results[i, "idx"] <- i
    results[i, "branch"] <- branch
    results[i, "run_total_time"] <- summary$sampling.time
    results[i, "pct_langevin"] <- summary$by.total['"langevin"', 'total.pct']
}

png('total_time.png')
boxplot(results$run_total_time ~ results$branch)
dev.off()

png('langevin_percent.png')
boxplot(results$pct_langevin ~ results$branch)
dev.off()
chfleming commented 2 months ago

Hi @katrinabrock , I just pushed the discussed code that sorts the time lags once, and then only calculates the requisite matrices once per unique time lag. This should be much faster than memoise, because no check is made per call and the sorting is done outside of the optimization loop.

I will be in China for ~2 weeks before I can get to anything else.

katrinabrock commented 2 months ago

I used the atime package to test pre and post your most recent commit as well as with and without memoise. This is taking different size subsets of the buffalo$Cilla dataset. Conclusion is memoise and your speedup have a similar modest gain in time for datasets the size of that full one (and presumably larger). My memoise version has a substantial memory cost, your speedup has a smaller but notable memory improvement.

image

Code to generate this plot:

N_length <- 8
times <- 8
seconds.limit <- 600

print(glue::glue('Max run time is {N_length*times*seconds.limit/3600} hours'))

N_min <- 100
N_max <- 3527
tdir <- tempfile()
dir.create(tdir)
git2r::clone("https://github.com/katrinabrock/ctmm", tdir)
atime.list <- atime::atime_versions(
 pkg.path=tdir,
 seconds.limit = seconds.limit,
 times=times,
 N=round(exp(seq(log(N_min), log(N_max), length.out=N_length))),
 setup={
    data("buffalo")
    Cilla <- buffalo$Cilla[1:N, ]
    m.iid <- ctmm::ctmm(sigma=23 %#% "km^2")
    m.ou <- ctmm::ctmm(sigma=23 %#% "km^2",tau=6 %#% "day")
    m.ouf <- ctmm::ctmm(sigma=23 %#% "km^2",tau=c(6 %#% "day",1 %#% "hour"))
 },
 expr={
    M.IID <- ctmm::ctmm.fit(Cilla,m.iid)
    M.OU <- ctmm::ctmm.fit(Cilla,m.ou)
    M.OUF <- ctmm::ctmm.fit(Cilla,m.ouf)
 },
 pre_speedup = "66d1f5180323ae5502417c908e4e686ba65d2334",
 post_speedup="37a1aa480ab326f524e4314731cbfb2107182180",
 memoise_no_speedup="cd390665b6076b113d27abad06f1f778f3bc66fe",
 memoise_with_speedup="a50c47178037483c76f521b60329395d1c538e00")
saveRDS(atime.list, './data/20240718_atime.rds')
# ran up to here as a script

# later ran this
atime.list <- readRDS('./data/20240718_atime.rds')
plot(atime.list)
katrinabrock commented 2 months ago

oh...one other thing....it looks both time and memory are increasing linearly on a log scale with data size ...so...exponential growth. Is that expected?

EDIT:walking home I realized both axises are log so it's linear which I think is what you said is expected.