Anirban166 / testComplexity

Asymptotic complexity testing framework
https://anirban166.github.io/testComplexity/
Other
36 stars 2 forks source link

Quantifying functions' code style #22

Closed Anirban166 closed 4 years ago

Anirban166 commented 4 years ago

I've been thinking about making some minor optimizations to my code style for the quantifying functions, for instance using a sequence along our parameter data.sizes to discard the extra if check for empty data.sizes:

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
  {
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  }
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
  timings.list <- list()
  for(i in seq(along = data.sizes))
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
    benchmarked.timings$data.size <- data.sizes[i]
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
    ifelse((mean(benchmarked.timings$time) > time.limit), break, next)
  }
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}

which is fine, but furthermore using some sort of an apply function is thought-of to be a better practice due to it being faster than a loop, for which I thought of using a lapply:

break.bool <- TRUE
lapply(seq(along = data.sizes), function(i)
  { if(break.bool)
    { benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
      benchmarked.timings$data.size <- data.sizes[i]
      timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
      if(mean(benchmarked.timings$time) > time.limit)
      break.bool <<- FALSE
    }
  })

Note that I had to discard the use of break since the chunk of code isn't in a loop anymore, for which I introduced a boolean break.bool set to true initially outside the scope of the lapply, and internally set to false using the scoping operator if the mean of benchmarked timings for an iteration exceeds the user-set/hardcoded time-limit.
This practice led to an error:

 Error in `colnames<-`(`*tmp*`, value = c("Timings", "Data sizes")) : 
  attempt to set 'colnames' on an object with less than two dimensions 
3.
stop("attempt to set 'colnames' on an object with less than two dimensions") 
2.
`colnames<-`(`*tmp*`, value = c("Timings", "Data sizes")) 
1.
asymptoticTimings(bubble.sort(sample(1:100, data.sizes, replace = TRUE)), 
    data.sizes = 10^seq(1, 3, by = 0.5)) 

So *tmp* was being assigned to our data.frame composed of the combination of the list elements in our timings.list, which am not sure why, but one potential reason could be for the lapply returning a list, since the output of a list is always a list. So I tried unlisting the lapply:

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
  {
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  }
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
  timings.list <- list()
  break.bool <- TRUE
  unlist(lapply(seq(along = data.sizes), function(i)
  {
    if(break.bool)
    { benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
      benchmarked.timings$data.size <- data.sizes[i]
      timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
      if(mean(benchmarked.timings$time) > time.limit)
      break.bool <<- FALSE
    }
  }))
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}

But in still results in the same error:

Error in `colnames<-`(`*tmp*`, value = c("Timings", "Data sizes")) : 
  attempt to set 'colnames' on an object with less than two dimensions 
3.
stop("attempt to set 'colnames' on an object with less than two dimensions") 
2.
`colnames<-`(`*tmp*`, value = c("Timings", "Data sizes")) 
1.
asymptoticTimings(bubble.sort(sample(1:100, data.sizes, replace = TRUE)), 
    data.sizes = 10^seq(1, 3, by = 0.5)) 
Anirban166 commented 4 years ago

@tdhock Also here, is the reason for *tmp* allocation against my expected data frame because of the function scope, where the values in timings.list are not accessible outside the scope of lapply?

Anirban166 commented 4 years ago

okay I got it to work, seems I was doing the wrong thing when I could just collect it directly into timings.list with the lapply

Anirban166 commented 4 years ago

@tdhock Is this any better than what we had?

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
  {
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  }
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
  break.bool <- TRUE
  timings.list <- list()
  timings.list <- lapply(seq(along = data.sizes), function(i)
  {
    if(break.bool)
    { benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
      if(mean(benchmarked.timings$time) > time.limit)
      break.bool <<- FALSE
      benchmarked.timings$data.size <- data.sizes[i]
      return(data.frame(benchmarked.timings$time, benchmarked.timings$data.size))
    }
  })
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}

Seems to be working fine, but dunno if it improved the speed much (noticed no difference)

image

Anirban166 commented 4 years ago

Could also use a pipe here

library(magrittr)
asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
  {
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  }
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
  break.bool <- TRUE
  timings.list <- list()
  timings.list <- seq(along = data.sizes) %>% lapply(function(i)
  {
    if(break.bool)
    { benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
      if(mean(benchmarked.timings$time) > time.limit)
      break.bool <<- FALSE
      benchmarked.timings$data.size <- data.sizes[i]
      return(data.frame(benchmarked.timings$time, benchmarked.timings$data.size))
    }
  })
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}

image

tdhock commented 4 years ago

hi

  1. for loops used to be slower than lapply/etc in old versions of R but that is no longer the case. in current R for loop is just as fast as apply etc, so please use whichever is easier to understand.
  2. use seq_along(some.vector) instead of 1:length(some.vector)
  3. please avoid using pipes, makes it harder to debug
  4. use if(scalar.logical)something else something.else if you only need to test one scalar value, and use ifelse(vector.logical, something.vector, something.else) if you need to test a vector of values. does that help?
Anirban166 commented 4 years ago

hi

  1. for loops used to be slower than lapply/etc in old versions of R but that is no longer the case. in current R for loop is just as fast as apply etc, so please use whichever is easier to understand.

ah okay, I'll stick to for-loops then (easier to understand)

  1. use seq_along(some.vector) instead of 1:length(some.vector)

yes I am using seq(along = data.sizes) currently

  1. please avoid using pipes, makes it harder to debug

alright sure (+ I would need to import magrittr/dplyr additionally as well)

  1. use if(scalar.logical)something else something.else if you only need to test one scalar value, and use ifelse(vector.logical, something.vector, something.else) if you need to test a vector of values.

okay, so for instance such type of changes would look like:

ifelse(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)), stop("data.sizes must not contain any NA/NaN/Infinite value."), return)
if(missing(max.seconds))
  time.limit = 10^8
else time.limit = max.seconds*10^9

does that help?

It certainly does, I'll keep these points in mind while writing future code - thanks!

Anirban166 commented 4 years ago

previous asymptoticTimings:

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  if(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)))
  {
    stop("data.sizes must not contain any NA/NaN/Infinite value.")
  }
  if(length(data.sizes) == 0)
  {
    stop("Cannot run on an empty vector for 'data.sizes'.")
  }
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  time.limit = ifelse(missing(max.seconds), 10^8, max.seconds*10^9)
  l <- length(data.sizes)
  timings.list <- list()
  for(i in 1:l)
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
    benchmarked.timings$data.size <- data.sizes[i]
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
    ifelse((mean(benchmarked.timings$time) > time.limit), break, next)
  }
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}

asymptoticTimings after the changes discussed above:

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  ifelse(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)), stop("data.sizes must not contain any NA/NaN/Infinite value."), return)
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  if(missing(max.seconds))
    time.limit = 10^8
  else time.limit = max.seconds*10^9
  timings.list <- list()
  for(i in seq(along = data.sizes))
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
    benchmarked.timings$data.size <- data.sizes[i]
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
    if(mean(benchmarked.timings$time) > time.limit)
      break
    else next
  }
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}
tdhock commented 4 years ago

also for assigning variables you should use

time.limit <- if(missing(max.seconds)) 10^8 else max.seconds*10^9
Anirban166 commented 4 years ago

also for assigning variables you should use

time.limit <- if(missing(max.seconds)) 10^8 else max.seconds*10^9

Yes that looks better, done

tdhock commented 4 years ago

also usually with a for loop people would expect all of the iterations to be computed, but they are not because you are using break. Maybe that works but it is confusing. Would be easier to understood if you changed that to a while loop I think.

On Thu, Jul 9, 2020 at 10:32 PM Anirban notifications@github.com wrote:

Closed #22 https://github.com/Anirban166/testComplexity/issues/22.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Anirban166/testComplexity/issues/22#event-3532415112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHDX4QKIN6C7EHOTOE5IWDR22RYFANCNFSM4OQWW7GQ .

Anirban166 commented 4 years ago

also usually with a for loop people would expect all of the iterations to be computed, but they are not because you are using break. Maybe that works but it is confusing. Would be easier to understood if you changed that to a while loop I think.

okay then it would look like this: (since we can't use seq(along = data.sizes) or seq_along(data.sizes) inside while)

asymptoticTimings <- function(e, data.sizes, max.seconds)
{
  ifelse(!all(!is.infinite(data.sizes) & !is.na(data.sizes) & !is.nan(data.sizes)), stop("data.sizes must not contain any NA/NaN/Infinite value."), return)
  lang.obj <- substitute(e)
  fun.obj  <- function(data.sizes)
  {
    eval(lang.obj)
  }
  time.limit <- if(missing(max.seconds)) 10^8 else max.seconds*10^9
  timings.list <- list()
  i <- 1
  while(i <= length(data.sizes))
  {
    benchmarked.timings <- as.data.frame(microbenchmark(fun.obj(data.sizes[i])))
    benchmarked.timings$data.size <- data.sizes[i]
    timings.list[[i]] <- data.frame(benchmarked.timings$time, benchmarked.timings$data.size)
    if(mean(benchmarked.timings$time) > time.limit) break 
    i <- i + 1
  }
  resultant.df <- do.call(rbind, timings.list)
  colnames(resultant.df) <- c("Timings", "Data sizes")
  return(resultant.df)
}

is this implementation ok?

I understand the need for change since for() tends to be better only if repetition count is known and iterated over for all the values as you mentioned (whereas while tends to be more general and suitable for breaking out when required) but is the lapply version suitable in place of this one i.e. are while loops favourable or a function over a lapply better, since we are avoiding for-loops?