Converting expressions to functions

tdhock commented 4 years ago

hi @Anirban166 here is a simple demonstration about how to convert expressions to functions in R

expr2fun <- function(e){
  lang.obj <- substitute(e)
  function(N){
    eval(lang.obj)
  }
}

substring.fun <- expr2fun({
  substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
})
gregexpr.fun <- expr2fun({
  gregexpr("a", paste(collapse="", rep("ab", N)), perl=TRUE)
})

substring.fun(5)

gregexpr.fun(5)

Please read the man pages for substitute and eval

Anirban166 commented 4 years ago

Thanks! that solved the issue for converting the function to a simple expr() form to use in our function call and pass the same during the call to benchmark within. It looks like the function is done.

Here it is:

asymptoticTimings = function(expression = expr(), N, RN = NULL)
{
  if(!is.null(RN)) # Use vector RN if specified
  {
    l <- length(RN)
    for(i in 1:l)
    {
      timings <- as.data.frame(microbenchmark(expr(RN[i])))
      if(i == 1)
      {
        timings$datasetsize <- datasetsizes[i] # adding column 'datasetsize' and append the first datasetsize (10) to observations 1-100.
        result <- timings # creating a data frame with our columns expr, timings and our newly appended datasetsize.
      }
      else
      { # appending the subsequent datasetsizes with the corresponding timings after first iteration. combining them using rbind():
        timings$datasetsize <- datasetsizes[i]
        result <- rbind(result, timings)
      }
    }
    return(result)
  }
  else # if optional parameter RN isn't specified
  {
    datasetsizes <- c(10^seq(1,N,by=0.5))
    # total obs. = (N + (N-1))x100. For eg: N=2 -> 300 obs., N=3 -> 500 obs., N=4 -> 700 obs. and so on.
    l <- length(datasetsizes)
    for(i in 1:l)
    {
      timings <- as.data.frame(microbenchmark(expr(datasetsizes[i])))
      if(i == 1)
      {
        timings$datasetsize <- datasetsizes[i]
        result <- timings
      }
      else
      {
        timings$datasetsize <- datasetsizes[i]
        result <-rbind(result, timings)
      }
    }
    return(result)
  }
}

I tested it on PeakSegPDPA:

library(PeakSegOptimal)

expr2fun <- function(e){
  lang.obj <- substitute(e)
  function(N){
    eval(lang.obj)
  }
}
# create our function to pass to asymptoticTimings:
PeakSegPDPA.fun <- expr2fun({
  PeakSegOptimal::PeakSegPDPA(rpois(N,10), max.segments=3L)
})

df <- asymptoticTimings(PeakSegPDPA.fun(), 4)
# for N = 4 -> 10, 31.62278, 100, 316.22777, 1000, 3162.27766, 10000.
dftwo <- asymptoticTimings(PeakSegPDPA.fun(), 3, c(10^seq(1,3,by=0.5)))
# for N = 3, RN = c(10^seq(1,3,by=0.5))).

Here are the results in data.tables:

> data.table(df)
                      expr  time datasetsize
  1: expr(datasetsizes[i]) 34600          10
  2: expr(datasetsizes[i])  9200          10
  3: expr(datasetsizes[i])  6200          10
  4: expr(datasetsizes[i])  5900          10
  5: expr(datasetsizes[i])  5900          10
 ---                                        
696: expr(datasetsizes[i])  3800       10000
697: expr(datasetsizes[i])  4000       10000
698: expr(datasetsizes[i])  3900       10000
699: expr(datasetsizes[i])  3800       10000
700: expr(datasetsizes[i])  4000       10000

> data.table(dftwo)
expr  time datasetsize
  1: expr(RN[i]) 46700          10
  2: expr(RN[i]) 19500          10
  3: expr(RN[i])  9900          10
  4: expr(RN[i])  9000          10
  5: expr(RN[i])  8400          10
 ---                              
496: expr(RN[i])  4100        1000
497: expr(RN[i])  8700        1000
498: expr(RN[i]) 12200        1000
499: expr(RN[i]) 10700        1000
500: expr(RN[i])  8500        1000

Please check if this is what you expected from the timings funtion. (I'll add documentation and tests for it if its okay)

tdhock commented 4 years ago

Hi first of all please read this blog post which explains why you should avoid the x <- rbind(x, new) idiom. https://tdhock.github.io/blog/2017/rbind-inside-outside/ Instead use the do.call(rbind, list.of.data.tables) idiom described therein.

tdhock commented 4 years ago

Second there is a ton a repeated code in your definition of asymptotic timings function. Please avoid repetition and stick to DRY programming https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

tdhock commented 4 years ago

third of all we should have some (mostly internal) functions which quantify timings of functions of N, whereas we should also have some other function which quantify timings of expressions of N (which is more user friendly). So I would recommend something like

asymptoticTimings <- function(e, N.seq){
  lang.obj <- substitute(e)
  fun.obj <- function(N){
    eval(lang.obj)
  }
  asymptoticTimingsFun(fun.obj, N.seq)
}

so then most of your timing logic can be implemented in asymptoticTimingsFun which assumes you input a function of N -- it will be simpler to work with functions e.g. fun.obj instead of expressions.

tdhock commented 4 years ago

For variable names use dot or underscore to separate words, e.g. data.set.size

tdhock commented 4 years ago

For the return value the expr column is constant so that can be ignored/removed without loss of information

tdhock commented 4 years ago

What is the difference between your N and RN arguments?

Anirban166 commented 4 years ago

I have tried to improvise on the points you mentioned, please check:

Timing computing function:

asymptoticTimingsFun <- function(fun.obj, N.seq)
{
data.set.sizes <- c(10^seq(1,N.seq,by=0.5))
# total obs. = (N + (N-1))x100. For eg: N=2 -> 300 obs., N=3 -> 500 obs., N=4 -> 700 obs. and so on.
l <- length(data.set.sizes)

timings.list <- list()

for(i in 1:l)
{
benchmarked.timings <- as.data.frame(microbenchmark(expr(data.set.sizes[i])))

benchmarked.timings$data.set.size <- data.set.sizes[i] # collecting data set sizes at each iteration

timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.set.size)
}

resultant.df <- do.call(rbind, timings.list)

colnames(resultant.df) <- c("Timings", "Data set sizes")

return(resultant.df)
}

Test run with our calling function:

> df <- asymptoticTimings(PeakSegPDPA(rpois(N,10), max.segments = 3L), 5)
> data.table(df)
 Timings Data set sizes
1:   25500          1e+01
2:    6500          1e+01
3:    4300          1e+01
4:    4200          1e+01
5:    4100          1e+01
---                       
896:    4400          1e+05
897:    4600          1e+05
898:    4500          1e+05
899:    4800          1e+05
900:    4700          1e+05

Hi first of all please read this blog post which explains why you should avoid the x <- rbind(x, new) idiom. https://tdhock.github.io/blog/2017/rbind-inside-outside/ Instead use the do.call(rbind, list.of.data.tables) idiom described therein.

Gotcha! I didn't knew rbind() resulted in the data.table/data.frame's allocated size to increase per iteration inside a loop.

Second there is a ton a repeated code in your definition of asymptotic timings function. Please avoid repetition and stick to DRY programming https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

Right, I had unnecessary repetitions in the if-else part and statements inside them. I have tried to avoid any repetitions in the code above. Hopefully it conforms to DRY principle. (please check)

third of all we should have some (mostly internal) functions which quantify timings of functions of N, whereas we should also have some other function which quantify timings of expressions of N (which is more user friendly). So I would recommend something like
asymptoticTimings <- function(e, N.seq){
  lang.obj <- substitute(e)
  fun.obj <- function(N){
    eval(lang.obj)
  }
  asymptoticTimingsFun(fun.obj, N.seq)
}
so then most of your timing logic can be implemented in asymptoticTimingsFun which assumes you input a function of N -- it will be simpler to work with functions e.g. fun.obj instead of expressions.

Yes! This was exactly what I was dubious about when thinking of creating examples for this, since the user had to explicitly take the extra step of converting their expression into a function before passing it into our function. It makes perfect sense to implement the expression -> function logic within and have a call to our Timings computing function which works on a passed function, instead of an expression as required..thanks for clearing it out and providing the approach :)

For variable names use dot or underscore to separate words, e.g. data.set.size

Done

For the return value the expr column is constant so that can be ignored/removed without loss of information

Yes another good point - am just extracting the time column (returned by the data frame from microbenchmark) now in my code above.

What is the difference between your N and RN arguments?

No difference I guess..to be honest, I didn't get whether RN is dependent on N or not, because N for sure is fixed. If RN is dependent on N, then it should be of the form c(value^seq(1, N , by = step.size)), where we can change value and step size right? If it isn't dependent on N, it can be any vector or sequence, like c(12, 500, 7010) for instance. Either way N will always be there since RN is optional, but N isn't..so if we are using RN, we should not use N - which I don't know how to ignore. (considering we are taking values for RN not dependent on N) This was what I thought of initially:

  if(!is.null(RN.seq)) 
  data.set.sizes <- RN.seq
  else
  data.set.sizes <- c(10^seq(1, N , by = 0.5))

RN would be useful only if its different from c(10^seq(1, N , by=0.5)), but I can't seem to figure a way how to go about ignoring the N, while using RN. (for values which do not have N in the vector sequence)

Anirban166 commented 4 years ago

Another thing I don't understand is how come the timings have so much difference in comparison to the ones I got during the test? The timings for PeakSegPDPA and cDPA I got here uptil N = 10000 are much more in value and it did take quite a few minutes to run the benchmark. But here the benchmark just takes a few seconds, and as you can see too, the timings are much less in value even though they go uptil the same value for data set size. Is there something am missing?

tdhock commented 4 years ago

please avoid using c when it is not necessary, e.g. c(10^seq(1,N.seq,by=0.5)) should be 10^seq(1,N.seq,by=0.5)

N.seq should have a default value NULL which means to use some smart/adaptive sequence (start with small N and keep increasing until you know what the computational complexity is).

Actually what does GuessCompX package do? how can we exploit the work/functions they have already done/implemented?

tdhock commented 4 years ago

finally your code does not use the fun.obj, and expr is undefined so why doesn't it give an error?

Anirban166 commented 4 years ago

finally your code does not use the fun.obj, and expr is undefined so why doesn't it give an error?

Oh yes..I forgot to call the function on my parameters! Now its working, tested on the same PeakSegOptimal::PeakSegPDPA (and it took a long time to compute for N = 5, as expected)

asymptoticTimings function

And for my code yesterday or above sorry I made a mistake and additionally included the expr while committing and pasting the same here, it was benchmarked.timings <- as.data.frame(microbenchmark((data.set.sizes[i]))) when I last run it, (from .Rhistory) which gives no error but incorrect timings. (since there was no function applied, now I got it)

Anirban166 commented 4 years ago

please avoid using c when it is not necessary, e.g. c(10^seq(1,N.seq,by=0.5)) should be 10^seq(1,N.seq,by=0.5)

Done

N.seq should have a default value NULL which means to use some smart/adaptive sequence (start with small N and keep increasing until you know what the computational complexity is).

That seems ideal for the user, but how can we figure out the point or size uptil which we must test a function to branch out between different complexities?
And I would like to know for the functions that you know the complexities of, at what sizes generally they show their actual trend? (like would N = 1,000 be a safe size for most algorithms that you know of?)

I understand that we wouldn't want to waste the user's time when using our function or hardcode a large dataset size value for the same, but then if we were to use an adaptive sequence, then our timings function must be linked with the classifying function directly (like for small N if our classifying function suggests linear, we will progressively check if its linear or not for a few more times on increasing data set sizes to see if its complexity changes..but then to check the complexity we will need the classifying function) for which one way is that we can check for a couple of next few values of data set size, say we find an algorithm to be of linear complexity at N = 10, then we check for 2 more subsequent increasing sizes of N, and if they fall in same complexity class we assume thats the class; if they don't, then we take the changed complexity class and check for more increasing sizes until the next 2 values of N are of the same complexity class.

The way I thought to implement my classifying function is based on just my observation though, for example consider the timings from the test, here for loglinear PeakSegPDPA, the division between the last and first timing is: PDPA_time_loglinear
whereas, for quadratic cDPA the equivalent ratio results in: cDPA_time_quadratic
There is a contrasting difference between both (difference increases for higher values of N) and if we could fix certain bounds for our complexities on these numbers, it would be possible to classify. The bounds can be made by some propotions (consider data set sizes 10, 31.6, 100, 316.3, 1000 etc, and 100 observations for each dataset size) like if for N = 10 we get say ratio value 8, (division of timing obs.100 by obs.1 = 8) then next if for N = 31.6 we get ratio 60 (i.e. division of timing obs.200 by obs.100 = 60, which is close to 64 or 8^2) then we can declare it as quadratic. To verify, we can test if similarly the ratio tends to be quadratic or not for next 2-3 sizes (100, 316.3, 1000), if it does then we can say its quadratic. (just a raw idea)

Actually what does GuessCompX package do? how can we exploit the work/functions they have already done/implemented?

As far as I understand, it first takes a data frame as input and the function which operates on it (along with additional parameters) and then inside the function (CompEst) it creates a vector of data sizes with possible replicates, and loops through it (like our timings function). The additional parameters include max.time which is used as a stopping condition for the loop (stopping when the previous iteration exceeded the time limit), which I remember you once mentioned about having a max time limit. The difference is that sampling is used here, but at the end a data frame is produced. (there are other stuff in between, but overall I guess thats it)
Then a generalized linear model or glm() procedure with a formula of glm(time ~ log(nb_rows)) is fitted for the different complexity functions, which eventually finds the best fit to the data based on the least mean squared error (for a considered complexity class in comparison to different complexity classes) as provided by the glm call() following a leave one out routine. That could be used for the classification function as I mentioned in my proposal as the second approach. (although I need to understand this a bit more clearly)
One trivial yet notable difference would be that the logic and all other functions are called inside CompEst, so the user needs to only call that function. For our case then we would need to incorporate everything into our classification function. (if we were to follow that) GuessCompx has exceptions though, and may not predict the complexity correctly always.

I had created a commented version (which include the functions and parts for memory complexity testing) in a forked version of GuessCompx here (for my reference)

Anirban166 commented 4 years ago

Then a generalized linear model or glm() procedure with a formula of glm(time ~ log(nb_rows)) is fitted for the different complexity functions, which eventually finds the best fit to the data based on the least mean squared error (for a considered complexity class in comparison to different complexity classes) as provided by the glm call() following a leave one out routine.

Following what I said above, here's the complexity classifying function I derived from that logic used in GuessCompx functions: (using formulae for the glms based on the size variations for our timings, and then accordingly acquiring the best fit which has the least error along a complexity class, through cross validation)

asymptoticTimeComplexityClass = function(model.df)
{
  constant   <- glm(Timings~1,                                      data = model.df); model.df['constant'] = fitted(constant)
  linear     <- glm(Timings~`Data set sizes`,                       data = model.df); model.df['linear'] = fitted(linear)
  squareroot <- glm(Timings~sqrt(`Data set sizes`),                 data = model.df); model.df['squareroot'] = fitted(squareroot)
  log        <- glm(Timings~log(`Data set sizes`),                  data = model.df); model.df['log'] = fitted(log)
  log.linear <- glm(Timings~`Data set sizes`*log(`Data set sizes`), data = model.df); model.df['log-linear'] = fitted(log.linear)
  quadratic  <- glm(Timings~I(`Data set sizes`^2),                  data = model.df); model.df['quadratic'] = fitted(quadratic)
  cubic      <- glm(Timings~I(`Data set sizes`^3),                  data = model.df); model.df['cubic'] = fitted(cubic)

  model.list <- list('constant'   = constant,
                     'linear'     = linear,
                     'squareroot' = squareroot,
                     'log'        = log,
                     'log-linear' = log.linear,
                     'quadratic'  = quadratic,
                     'cubic'      = cubic
                    )

  cross.validated.errors <- lapply(model.list, function(x) cv.glm(model.df, x)$delta[2])
  best.model <- names(which.min(cross.validated.errors))
  print(best.model)
}

I've tested it on data frames obtained from asymptoticTimings for functions PeakSegPDPA and cDPA with N = 4 and it works as expected:

> df.one <- asymptoticTimings(PeakSegPDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), 4)
> data.table(df.one)
       Timings Data set sizes
  1:    283700             10
  2:    165400             10
  3:    146900             10
  4:    148500             10
  5:    128500             10
 ---                         
696: 332665200          10000
697: 299911400          10000
698: 298588500          10000
699: 327627000          10000
700: 344481500          10000
> df.two <- asymptoticTimings(cDPA(rpois(N,10), rep(1,length(rpois(N,10))), 3L), 4)
> data.table(df.two)
        Timings Data set sizes
  1:     105300             10
  2:      42000             10
  3:      38400             10
  4:      37500             10
  5:      35600             10
 ---                          
696: 5025363600          10000
697: 6859596000          10000
698: 3692323500          10000
699: 3543158700          10000
700: 4220219200          10000
> asymptoticTimeComplexityClass(df.one)
[1] "log-linear"
> asymptoticTimeComplexityClass(df.two)
[1] "quadratic"

tdhock commented 4 years ago

ok that is good to know. seems like this issue is getting a bit off topic, maybe create another one for discussion about implementing the complexity classification? (I think you should avoid doing that yourself if at all possible and instead just use functions from GuessCompx)

Anirban166 commented 4 years ago

ok that is good to know. seems like this issue is getting a bit off topic, maybe create another one for discussion about implementing the complexity classification? (I think you should avoid doing that yourself if at all possible and instead just use functions from GuessCompx)

Sure, I'll do that. and okay!

Anirban166 / testComplexity

Converting expressions to functions #1