Open hope-data-science opened 2 years ago
Hi @hope-data-science, thanks for posting your request.
I'm interested in learning your specific use case for chunked data, are your calculations memory constrained? Or would you like to process chunks in parallel to speed up calculations?
Because fst
reads are already multithreaded, using chunks won't do much for the read (and write) times but it will reduce the memory needed. However, a lot of operations cannot easily be performed on chunks and then correctly combined into an overall result. For example, the median
cannot be applied to chunks and then later combined, you would need additional logic for that.
(the same applies to any function where some ordering of column data is needed)
but indeed you can use the from
and to
arguments to read chunks if you want:
library(dplyr)
library(fst)
tmp_file <- tempfile(fileext = "fst")
# write sample fst file
x <- data.frame(
X = sample(sample(1:100, 1000, replace = TRUE))
) %>%
write_fst(tmp_file)
# determine chunks
nr_of_chunks <- 8
chunk_size <- metadata_fst(tmp_file)$nrOfRows / nr_of_chunks
# custom function to run on each chunk
my_funct <- function(tbl, chunk) {
tbl %>%
summarise(
Mean = mean(X)
) %>%
mutate(
Chunk = chunk
)
}
# run custom function on each chunk
z <- lapply(1:nr_of_chunks, function(chunk, custom_function) {
y <- read_fst(
tmp_file,
from = 1 + (chunk - 1) * chunk_size,
to = chunk * chunk_size
)
custom_function(y, chunk)
}, my_funct) %>%
bind_rows()
print(z)
#> Mean Chunk
#> 1 51.680 1
#> 2 46.936 2
#> 3 51.304 3
#> 4 47.824 4
#> 5 52.000 5
#> 6 53.712 6
#> 7 55.440 7
#> 8 51.256 8
from there it depends on the actual custom function used how you need to combine the chunks, in this case:
z %>%
summarise(
Mean = mean(Mean)
)
#> Mean
#> 1 51.269
I think from
and to
do a good job. I just wonder whether the speed would slow down when from
and to
changes. If it never changes, I think this issue is done.
Recently, I've learned a package named chunked. Since fst supports row access via row number, I suggest maybe
read.fst
function could support this sort of chunkwise operation. Any ideas to include this as a new feature? Are there solutions possible?Thanks.