fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
614 stars 42 forks source link

Chunkwise support for `read.fst`? #269

Open hope-data-science opened 2 years ago

hope-data-science commented 2 years ago

Recently, I've learned a package named chunked. Since fst supports row access via row number, I suggest maybe read.fst function could support this sort of chunkwise operation. Any ideas to include this as a new feature? Are there solutions possible?

Thanks.

MarcusKlik commented 1 year ago

Hi @hope-data-science, thanks for posting your request.

I'm interested in learning your specific use case for chunked data, are your calculations memory constrained? Or would you like to process chunks in parallel to speed up calculations?

Because fst reads are already multithreaded, using chunks won't do much for the read (and write) times but it will reduce the memory needed. However, a lot of operations cannot easily be performed on chunks and then correctly combined into an overall result. For example, the median cannot be applied to chunks and then later combined, you would need additional logic for that.

(the same applies to any function where some ordering of column data is needed)

MarcusKlik commented 1 year ago

but indeed you can use the from and to arguments to read chunks if you want:

library(dplyr)
library(fst)

tmp_file <- tempfile(fileext = "fst")

# write sample fst file
x <- data.frame(
  X = sample(sample(1:100, 1000, replace = TRUE))
) %>%
  write_fst(tmp_file)

# determine chunks
nr_of_chunks <- 8
chunk_size <- metadata_fst(tmp_file)$nrOfRows / nr_of_chunks

# custom function to run on each chunk
my_funct <- function(tbl, chunk) {
  tbl %>%
    summarise(
      Mean = mean(X)
    ) %>%
    mutate(
      Chunk = chunk
    )
}

# run custom function on each chunk
z <- lapply(1:nr_of_chunks, function(chunk, custom_function) {
  y <- read_fst(
    tmp_file,
    from = 1 + (chunk - 1) * chunk_size,
    to = chunk * chunk_size
  )

  custom_function(y, chunk)
}, my_funct) %>%
  bind_rows()

print(z)
#>     Mean Chunk
#> 1 51.680     1
#> 2 46.936     2
#> 3 51.304     3
#> 4 47.824     4
#> 5 52.000     5
#> 6 53.712     6
#> 7 55.440     7
#> 8 51.256     8

from there it depends on the actual custom function used how you need to combine the chunks, in this case:

z %>%
  summarise(
    Mean = mean(Mean)
  )
#>     Mean
#> 1 51.269
hope-data-science commented 1 year ago

I think from and to do a good job. I just wonder whether the speed would slow down when from and to changes. If it never changes, I think this issue is done.