Closed tdhock closed 4 years ago
Thanks! that solved the issue for converting the function to a simple expr() form to use in our function call and pass the same during the call to benchmark within. It looks like the function is done.
Here it is:
asymptoticTimings = function(expression = expr(), N, RN = NULL)
{
if(!is.null(RN)) # Use vector RN if specified
{
l <- length(RN)
for(i in 1:l)
{
timings <- as.data.frame(microbenchmark(expr(RN[i])))
if(i == 1)
{
timings$datasetsize <- datasetsizes[i] # adding column 'datasetsize' and append the first datasetsize (10) to observations 1-100.
result <- timings # creating a data frame with our columns expr, timings and our newly appended datasetsize.
}
else
{ # appending the subsequent datasetsizes with the corresponding timings after first iteration. combining them using rbind():
timings$datasetsize <- datasetsizes[i]
result <- rbind(result, timings)
}
}
return(result)
}
else # if optional parameter RN isn't specified
{
datasetsizes <- c(10^seq(1,N,by=0.5))
# total obs. = (N + (N-1))x100. For eg: N=2 -> 300 obs., N=3 -> 500 obs., N=4 -> 700 obs. and so on.
l <- length(datasetsizes)
for(i in 1:l)
{
timings <- as.data.frame(microbenchmark(expr(datasetsizes[i])))
if(i == 1)
{
timings$datasetsize <- datasetsizes[i]
result <- timings
}
else
{
timings$datasetsize <- datasetsizes[i]
result <-rbind(result, timings)
}
}
return(result)
}
}
I tested it on PeakSegPDPA:
library(PeakSegOptimal)
expr2fun <- function(e){
lang.obj <- substitute(e)
function(N){
eval(lang.obj)
}
}
# create our function to pass to asymptoticTimings:
PeakSegPDPA.fun <- expr2fun({
PeakSegOptimal::PeakSegPDPA(rpois(N,10), max.segments=3L)
})
df <- asymptoticTimings(PeakSegPDPA.fun(), 4)
# for N = 4 -> 10, 31.62278, 100, 316.22777, 1000, 3162.27766, 10000.
dftwo <- asymptoticTimings(PeakSegPDPA.fun(), 3, c(10^seq(1,3,by=0.5)))
# for N = 3, RN = c(10^seq(1,3,by=0.5))).
Here are the results in data.tables:
> data.table(df)
expr time datasetsize
1: expr(datasetsizes[i]) 34600 10
2: expr(datasetsizes[i]) 9200 10
3: expr(datasetsizes[i]) 6200 10
4: expr(datasetsizes[i]) 5900 10
5: expr(datasetsizes[i]) 5900 10
---
696: expr(datasetsizes[i]) 3800 10000
697: expr(datasetsizes[i]) 4000 10000
698: expr(datasetsizes[i]) 3900 10000
699: expr(datasetsizes[i]) 3800 10000
700: expr(datasetsizes[i]) 4000 10000
> data.table(dftwo)
expr time datasetsize
1: expr(RN[i]) 46700 10
2: expr(RN[i]) 19500 10
3: expr(RN[i]) 9900 10
4: expr(RN[i]) 9000 10
5: expr(RN[i]) 8400 10
---
496: expr(RN[i]) 4100 1000
497: expr(RN[i]) 8700 1000
498: expr(RN[i]) 12200 1000
499: expr(RN[i]) 10700 1000
500: expr(RN[i]) 8500 1000
Please check if this is what you expected from the timings funtion. (I'll add documentation and tests for it if its okay)
Hi first of all please read this blog post which explains why you should avoid the x <- rbind(x, new)
idiom. https://tdhock.github.io/blog/2017/rbind-inside-outside/ Instead use the do.call(rbind, list.of.data.tables)
idiom described therein.
Second there is a ton a repeated code in your definition of asymptotic timings function. Please avoid repetition and stick to DRY programming https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
third of all we should have some (mostly internal) functions which quantify timings of functions of N, whereas we should also have some other function which quantify timings of expressions of N (which is more user friendly). So I would recommend something like
asymptoticTimings <- function(e, N.seq){
lang.obj <- substitute(e)
fun.obj <- function(N){
eval(lang.obj)
}
asymptoticTimingsFun(fun.obj, N.seq)
}
so then most of your timing logic can be implemented in asymptoticTimingsFun
which assumes you input a function of N -- it will be simpler to work with functions e.g. fun.obj
instead of expressions.
For variable names use dot or underscore to separate words, e.g. data.set.size
For the return value the expr
column is constant so that can be ignored/removed without loss of information
What is the difference between your N
and RN
arguments?
I have tried to improvise on the points you mentioned, please check:
Timing computing function:
asymptoticTimingsFun <- function(fun.obj, N.seq)
{
data.set.sizes <- c(10^seq(1,N.seq,by=0.5))
# total obs. = (N + (N-1))x100. For eg: N=2 -> 300 obs., N=3 -> 500 obs., N=4 -> 700 obs. and so on.
l <- length(data.set.sizes)
timings.list <- list()
for(i in 1:l)
{
benchmarked.timings <- as.data.frame(microbenchmark(expr(data.set.sizes[i])))
benchmarked.timings$data.set.size <- data.set.sizes[i] # collecting data set sizes at each iteration
timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.set.size)
}
resultant.df <- do.call(rbind, timings.list)
colnames(resultant.df) <- c("Timings", "Data set sizes")
return(resultant.df)
}
> df <- asymptoticTimings(PeakSegPDPA(rpois(N,10), max.segments = 3L), 5)
> data.table(df)
Timings Data set sizes
1: 25500 1e+01
2: 6500 1e+01
3: 4300 1e+01
4: 4200 1e+01
5: 4100 1e+01
---
896: 4400 1e+05
897: 4600 1e+05
898: 4500 1e+05
899: 4800 1e+05
900: 4700 1e+05
Hi first of all please read this blog post which explains why you should avoid the
x <- rbind(x, new)
idiom. https://tdhock.github.io/blog/2017/rbind-inside-outside/ Instead use thedo.call(rbind, list.of.data.tables)
idiom described therein.
Gotcha! I didn't knew rbind() resulted in the data.table/data.frame's allocated size to increase per iteration inside a loop.
Second there is a ton a repeated code in your definition of asymptotic timings function. Please avoid repetition and stick to DRY programming https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
Right, I had unnecessary repetitions in the if-else part and statements inside them. I have tried to avoid any repetitions in the code above. Hopefully it conforms to DRY principle. (please check)
third of all we should have some (mostly internal) functions which quantify timings of functions of N, whereas we should also have some other function which quantify timings of expressions of N (which is more user friendly). So I would recommend something like
asymptoticTimings <- function(e, N.seq){ lang.obj <- substitute(e) fun.obj <- function(N){ eval(lang.obj) } asymptoticTimingsFun(fun.obj, N.seq) }
so then most of your timing logic can be implemented in
asymptoticTimingsFun
which assumes you input a function of N -- it will be simpler to work with functions e.g.fun.obj
instead of expressions.
Yes! This was exactly what I was dubious about when thinking of creating examples for this, since the user had to explicitly take the extra step of converting their expression into a function before passing it into our function. It makes perfect sense to implement the expression -> function logic within and have a call to our Timings computing function which works on a passed function, instead of an expression as required..thanks for clearing it out and providing the approach :)
For variable names use dot or underscore to separate words, e.g.
data.set.size
Done
For the return value the
expr
column is constant so that can be ignored/removed without loss of information
Yes another good point - am just extracting the time column (returned by the data frame from microbenchmark) now in my code above.
What is the difference between your
N
andRN
arguments?
No difference I guess..to be honest, I didn't get whether RN is dependent on N or not, because N for sure is fixed.
If RN is dependent on N, then it should be of the form c(value^seq(1, N , by = step.size))
, where we can change value and step size right? If it isn't dependent on N, it can be any vector or sequence, like c(12, 500, 7010)
for instance. Either way N will always be there since RN is optional, but N isn't..so if we are using RN, we should not use N - which I don't know how to ignore. (considering we are taking values for RN not dependent on N) This was what I thought of initially:
if(!is.null(RN.seq))
data.set.sizes <- RN.seq
else
data.set.sizes <- c(10^seq(1, N , by = 0.5))
RN would be useful only if its different from c(10^seq(1, N , by=0.5))
, but I can't seem to figure a way how to go about ignoring the N, while using RN. (for values which do not have N in the vector sequence)
Another thing I don't understand is how come the timings have so much difference in comparison to the ones I got during the test? The timings for PeakSegPDPA and cDPA I got here uptil N = 10000 are much more in value and it did take quite a few minutes to run the benchmark. But here the benchmark just takes a few seconds, and as you can see too, the timings are much less in value even though they go uptil the same value for data set size. Is there something am missing?
please avoid using c
when it is not necessary, e.g. c(10^seq(1,N.seq,by=0.5))
should be 10^seq(1,N.seq,by=0.5)
N.seq should have a default value NULL which means to use some smart/adaptive sequence (start with small N and keep increasing until you know what the computational complexity is).
Actually what does GuessCompX package do? how can we exploit the work/functions they have already done/implemented?
finally your code does not use the fun.obj, and expr is undefined so why doesn't it give an error?
finally your code does not use the fun.obj, and expr is undefined so why doesn't it give an error?
Oh yes..I forgot to call the function on my parameters! Now its working, tested on the same PeakSegOptimal::PeakSegPDPA (and it took a long time to compute for N = 5, as expected)
And for my code yesterday or above sorry I made a mistake and additionally included the expr while committing and pasting the same here, it was benchmarked.timings <- as.data.frame(microbenchmark((data.set.sizes[i])))
when I last run it, (from .Rhistory) which gives no error but incorrect timings. (since there was no function applied, now I got it)
please avoid using
c
when it is not necessary, e.g.c(10^seq(1,N.seq,by=0.5))
should be10^seq(1,N.seq,by=0.5)
Done
N.seq should have a default value NULL which means to use some smart/adaptive sequence (start with small N and keep increasing until you know what the computational complexity is).
That seems ideal for the user, but how can we figure out the point or size uptil which we must test a function to branch out between different complexities?
And I would like to know for the functions that you know the complexities of, at what sizes generally they show their actual trend? (like would N = 1,000 be a safe size for most algorithms that you know of?)
I understand that we wouldn't want to waste the user's time when using our function or hardcode a large dataset size value for the same, but then if we were to use an adaptive sequence, then our timings function must be linked with the classifying function directly (like for small N if our classifying function suggests linear, we will progressively check if its linear or not for a few more times on increasing data set sizes to see if its complexity changes..but then to check the complexity we will need the classifying function) for which one way is that we can check for a couple of next few values of data set size, say we find an algorithm to be of linear complexity at N = 10, then we check for 2 more subsequent increasing sizes of N, and if they fall in same complexity class we assume thats the class; if they don't, then we take the changed complexity class and check for more increasing sizes until the next 2 values of N are of the same complexity class.
The way I thought to implement my classifying function is based on just my observation though, for example consider the timings from the test, here for loglinear PeakSegPDPA, the division between the last and first timing is:
whereas, for quadratic cDPA the equivalent ratio results in:
There is a contrasting difference between both (difference increases for higher values of N) and if we could fix certain bounds for our complexities on these numbers, it would be possible to classify. The bounds can be made by some propotions (consider data set sizes 10, 31.6, 100, 316.3, 1000 etc, and 100 observations for each dataset size) like if for N = 10 we get say ratio value 8, (division of timing obs.100 by obs.1 = 8) then next if for N = 31.6 we get ratio 60 (i.e. division of timing obs.200 by obs.100 = 60, which is close to 64 or 8^2) then we can declare it as quadratic. To verify, we can test if similarly the ratio tends to be quadratic or not for next 2-3 sizes (100, 316.3, 1000), if it does then we can say its quadratic. (just a raw idea)
Actually what does GuessCompX package do? how can we exploit the work/functions they have already done/implemented?
As far as I understand, it first takes a data frame as input and the function which operates on it (along with additional parameters) and then inside the function (CompEst) it creates a vector of data sizes with possible replicates, and loops through it (like our timings function). The additional parameters include max.time which is used as a stopping condition for the loop (stopping when the previous iteration exceeded the time limit), which I remember you once mentioned about having a max time limit. The difference is that sampling is used here, but at the end a data frame is produced. (there are other stuff in between, but overall I guess thats it)
Then a generalized linear model or glm()
procedure with a formula of glm(time ~ log(nb_rows))
is fitted for the different complexity functions, which eventually finds the best fit to the data based on the least mean squared error (for a considered complexity class in comparison to different complexity classes) as provided by the glm call() following a leave one out routine. That could be used for the classification function as I mentioned in my proposal as the second approach. (although I need to understand this a bit more clearly)
One trivial yet notable difference would be that the logic and all other functions are called inside CompEst, so the user needs to only call that function. For our case then we would need to incorporate everything into our classification function. (if we were to follow that)
GuessCompx has exceptions though, and may not predict the complexity correctly always.
I had created a commented version (which include the functions and parts for memory complexity testing) in a forked version of GuessCompx here (for my reference)
Then a generalized linear model or glm() procedure with a formula of glm(time ~ log(nb_rows)) is fitted for the different complexity functions, which eventually finds the best fit to the data based on the least mean squared error (for a considered complexity class in comparison to different complexity classes) as provided by the glm call() following a leave one out routine.
Following what I said above, here's the complexity classifying function I derived from that logic used in GuessCompx functions: (using formulae for the glm
s based on the size variations for our timings, and then accordingly acquiring the best fit which has the least error along a complexity class, through cross validation)
asymptoticTimeComplexityClass = function(model.df)
{
constant <- glm(Timings~1, data = model.df); model.df['constant'] = fitted(constant)
linear <- glm(Timings~`Data set sizes`, data = model.df); model.df['linear'] = fitted(linear)
squareroot <- glm(Timings~sqrt(`Data set sizes`), data = model.df); model.df['squareroot'] = fitted(squareroot)
log <- glm(Timings~log(`Data set sizes`), data = model.df); model.df['log'] = fitted(log)
log.linear <- glm(Timings~`Data set sizes`*log(`Data set sizes`), data = model.df); model.df['log-linear'] = fitted(log.linear)
quadratic <- glm(Timings~I(`Data set sizes`^2), data = model.df); model.df['quadratic'] = fitted(quadratic)
cubic <- glm(Timings~I(`Data set sizes`^3), data = model.df); model.df['cubic'] = fitted(cubic)
model.list <- list('constant' = constant,
'linear' = linear,
'squareroot' = squareroot,
'log' = log,
'log-linear' = log.linear,
'quadratic' = quadratic,
'cubic' = cubic
)
cross.validated.errors <- lapply(model.list, function(x) cv.glm(model.df, x)$delta[2])
best.model <- names(which.min(cross.validated.errors))
print(best.model)
}
I've tested it on data frames obtained from asymptoticTimings
for functions PeakSegPDPA and cDPA with N = 4
and it works as expected:
> df.one <- asymptoticTimings(PeakSegPDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), 4)
> data.table(df.one)
Timings Data set sizes
1: 283700 10
2: 165400 10
3: 146900 10
4: 148500 10
5: 128500 10
---
696: 332665200 10000
697: 299911400 10000
698: 298588500 10000
699: 327627000 10000
700: 344481500 10000
> df.two <- asymptoticTimings(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), 4)
> data.table(df.two)
Timings Data set sizes
1: 105300 10
2: 42000 10
3: 38400 10
4: 37500 10
5: 35600 10
---
696: 5025363600 10000
697: 6859596000 10000
698: 3692323500 10000
699: 3543158700 10000
700: 4220219200 10000
> asymptoticTimeComplexityClass(df.one)
[1] "log-linear"
> asymptoticTimeComplexityClass(df.two)
[1] "quadratic"
ok that is good to know. seems like this issue is getting a bit off topic, maybe create another one for discussion about implementing the complexity classification? (I think you should avoid doing that yourself if at all possible and instead just use functions from GuessCompx)
ok that is good to know. seems like this issue is getting a bit off topic, maybe create another one for discussion about implementing the complexity classification? (I think you should avoid doing that yourself if at all possible and instead just use functions from GuessCompx)
Sure, I'll do that. and okay!
hi @Anirban166 here is a simple demonstration about how to convert expressions to functions in R
Please read the man pages for substitute and eval