dipterix / filearray

Out-of-memory Arrays in R
https://dipterix.org/filearray/
17 stars 2 forks source link

min and argmin #10

Closed talegari closed 3 months ago

talegari commented 4 months ago

I have an use case where I need to know the min and argmin of the filearray. Right now, min followed by argmin does the job.

Question: This is two passes over data. Can a single pass do it better, say when min is called we also store the index (as an attribute maybe).

dipterix commented 4 months ago

Maybe I can use existing functions to implement this feature. It will be slower than loading the data fully into the memory to run min/max, but it works in out-of-memory situation.

Let me know if the performance is OK to you and I will add the function to the next release. (The function name might differ, depending on the compatibility, I might just edit on existing fwhich function)

x <- filearray::as_filearray(array(rnorm(1000000), c(100,100,100)))

filearray_which <- function(x, fwhich = which.max) {
  filearray::mapreduce(
    x, 
    map = function(data, size, start_index){
      idx <- fwhich(data[seq_len(size)])
      if(is.logical(idx)) { idx <- which(idx, useNames = FALSE) }
      if(!length(idx)) { return(NULL) }
      cbind(start_index + idx - 1, data[idx])
    },
    reduce = function(mapped_list) {
      mapped_data <- do.call("rbind", mapped_list)
      if(!length(mapped_data)) { return(integer(0L)) }

      idx <- fwhich(mapped_data[, 2])
      res <- mapped_data[idx, , drop = FALSE]
      structure(
        res[, 1],
        value = res[, 2],
        location = arrayInd(res[, 1], dim(x))
      )
    }
  )
}

# check & microbenchmark
microbenchmark::microbenchmark(
  filearray = { filearray_which(x, which.max) },
  native = { which.max(x[]) },

  times = 10, 
  check = function(v) { all(v[[1]] == v[[2]]) }
)

microbenchmark::microbenchmark(
  filearray = { filearray_which(x, which.min) },
  native = { which.min(x[]) },

  times = 10, 
  check = function(v) { all(v[[1]] == v[[2]]) }
)

microbenchmark::microbenchmark(
  filearray = { filearray_which(x, function(data) {data < 1}) },
  native = { which(x[] < 1) },

  times = 10, 
  check = function(v) { all(v[[1]] == v[[2]]) }
)
dipterix commented 3 months ago

The requested feature has been integrated into dev version function fwhich (see commit https://github.com/dipterix/filearray/commit/24a4765f76f6f850d807acd98787034c909c51ad). This change will be included in the next CRAN release. For dev-version, check https://rave-ieeg.r-universe.dev/filearray