edwindj / chunked

Chunkwise Text-file Processing for 'dplyr'
https://edwindj.github.io/chunked
164 stars 7 forks source link

Implementing sample_n and sample_frac #8

Open ColinFay opened 6 years ago

ColinFay commented 6 years ago

We could implement a chunk wise sample_n / sample_frac with:

library(tidyverse)
big <- rerun(1000, iris) %>% bind_rows()
path <- tempfile()
write_csv(big, path)

library(chunked)
sample_n.chunkwise <- function(.data, size){
  cmd <- lazyeval::lazy(sample_n(.data, size))
  chunked:::record(.data, cmd)
}

read_csv_chunkwise(path) %>% 
  sample_n(1) %>% 
  collect() 

The sample would be done in each chunk that way.

What do you think about that? If it sounds like a good idea, let me know and I'll send you a PR.

edwindj commented 6 years ago

I like the idea! Minor problem with sample_n is that it would not have the same semantics: it would return a sample of number of chunks * n instead of n, but if we document that I can live with that :-)

xiaodaigh commented 5 years ago

disk.frame has implemented a sample_frac and sample_n is pending.