Closed talegari closed 3 months ago
Maybe I can use existing functions to implement this feature. It will be slower than loading the data fully into the memory to run min/max
, but it works in out-of-memory situation.
Let me know if the performance is OK to you and I will add the function to the next release. (The function name might differ, depending on the compatibility, I might just edit on existing fwhich
function)
x <- filearray::as_filearray(array(rnorm(1000000), c(100,100,100)))
filearray_which <- function(x, fwhich = which.max) {
filearray::mapreduce(
x,
map = function(data, size, start_index){
idx <- fwhich(data[seq_len(size)])
if(is.logical(idx)) { idx <- which(idx, useNames = FALSE) }
if(!length(idx)) { return(NULL) }
cbind(start_index + idx - 1, data[idx])
},
reduce = function(mapped_list) {
mapped_data <- do.call("rbind", mapped_list)
if(!length(mapped_data)) { return(integer(0L)) }
idx <- fwhich(mapped_data[, 2])
res <- mapped_data[idx, , drop = FALSE]
structure(
res[, 1],
value = res[, 2],
location = arrayInd(res[, 1], dim(x))
)
}
)
}
# check & microbenchmark
microbenchmark::microbenchmark(
filearray = { filearray_which(x, which.max) },
native = { which.max(x[]) },
times = 10,
check = function(v) { all(v[[1]] == v[[2]]) }
)
microbenchmark::microbenchmark(
filearray = { filearray_which(x, which.min) },
native = { which.min(x[]) },
times = 10,
check = function(v) { all(v[[1]] == v[[2]]) }
)
microbenchmark::microbenchmark(
filearray = { filearray_which(x, function(data) {data < 1}) },
native = { which(x[] < 1) },
times = 10,
check = function(v) { all(v[[1]] == v[[2]]) }
)
The requested feature has been integrated into dev version function fwhich
(see commit https://github.com/dipterix/filearray/commit/24a4765f76f6f850d807acd98787034c909c51ad). This change will be included in the next CRAN release. For dev-version, check https://rave-ieeg.r-universe.dev/filearray
I have an use case where I need to know the min and argmin of the filearray. Right now,
min
followed byargmin
does the job.Question: This is two passes over data. Can a single pass do it better, say when min is called we also store the index (as an attribute maybe).