Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.58k stars 976 forks source link

Chunked fread #1721

Open UnkindPartition opened 8 years ago

UnkindPartition commented 8 years ago

Consider a case when there's a large csv file, but it can be processed by chunks. It would be nice if fread could read the file in chunks. See also Reading in chunks at a time using fread in package data.table on StackOverflow.

The interface would be something like fread.apply(input, fun, chunk.size = 1000, ...), where fun would be applied (similar to lapply) to subsequent data table chunks read from input of the size at most chunk.size.

If there's a consensus, I could work on a PR.

st-pasha commented 6 years ago

fread already reads the input file in chunks, avoiding carrying large parts of input in-memory. This process is invisible to the user (unless "verbose" mode is turned on), and results in a DataTable that has data from the entire file.

If I understand correctly, this FR asks for a new fread "mode" where the result is not a single DataTable, but rather a sequence of smaller DataTables emitted one-at-a-time via some callback or generator mechanism.

The only complication that I foresee is that fread sometimes has to re-read the data (say, due to "out-of-sample type exceptions" or similar problems). Currently at most 1 re-read may happen, but in the future more re-read steps may be necessary to address some of the corner cases.

st-pasha commented 6 years ago

It would be easier to implement emitting DataTables of approximate size (i.e. not exactly 1,000,000 rows, but approximately that many). Then the logic could be implemented in pushBuffers function, without changing freadMain.

UnkindPartition commented 6 years ago

FWIW, this is no longer relevant for me.

mattdowle commented 6 years ago

To me, CSV is a one-off on the way to a binary or database. If it's so large that it won't fit and chunking is needed, then the data should be in a database or binary format. We have done parallel reading in chunks, but to result in the final full table. The sampling uses the full file, etc. Very unlikely we'll provide a user api for chunked csv access. skip= and nrows= are only really for working around format errors; those aren't meant to provide chunked access really.

MichaelChirico commented 6 years ago

Just came across this Q & A, which suggests a simple (???) way to accomplish this would be to allow fread to handle connections as input, i.e., this is a subset of #561:

https://stackoverflow.com/a/30403877/3576984

st-pasha commented 6 years ago

@MichaelChirico I'm not sure we are all on the same page of what it means to "support file connections". For example, python's fread supports reading from file-like objects (analogue of file connections in R) in a trivial way: you give me a read()-able object, I dump its content into a temporary file, then fread that file.

However this wouldn't allow chunking the input. I mean, if you ask read() to just return a certain number of bytes, then you're virtually guaranteed to have the last line broken (perhaps even having an unterminated string literal). Re-reading the last line with the next chunk would not work either - a file connection is not always seekable. Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly. All of these requirements would break if you insist that the input must be a streamable non-seekable object. (Unless you dump it into a file that is).

MichaelChirico commented 6 years ago

what I have in mind is not for fread to handle the chunking automatically, but for the user to be able to rig something up themselves that amounts to chunking.

so, perhaps user won't have access to all the bells & whistles (e.g. column re-read as you said), but e.g. if they can supply colClasses that goes away.

I might not be fully grokking the engineering requirements though. fread(paste(readLines(f, n = 1e5), collapse = '\n')) is a way to accomplish it now, quite inefficiently of course. is there no way to imitate that without substantial code refactoring (conditional on fread accepting connections)?

st-pasha commented 6 years ago

paste(readLines(f, n = 1e5), collapse = '\n') reads the entire input into a character vector, then copies the content into a single memory buffer, then passes that buffer to fread. This is less efficient than directly dumping the content of f into a temporary file and reading that file.

malcook commented 6 years ago

@mattdowle - I disagree with your reason for closing this. There are lots of examples of huge datasets that have no business in a database and lots of Streaming algorithms to work with them. Why not continue to keep your hat in this arena. DT excels so well in so many ways, and it already interoperates with non-seekable inputs. I expect @st-pasha's suggestion that this "could be implemented in pushBuffers function" is probably spot on.

And clearly, hope springs eternal: Reading in chunks at a time using fread in package data.table.

Just sayin'

mattdowle commented 6 years ago

@malcook

There are lots of examples of huge datasets that have no business in a database

Please provide one example so I can understand.

and lots of Streaming algorithms to work with them

Please provide one example.

If I understand correctly, you're asking for fread to support this in a CSV file. We want to do such things, but not in CSV. Rather, in a columnar binary format; i.e. a database. That doesn't mean SQL, nor Oracle, MySQL or MS SQL Server, but in a columnar binary format. You get the CSV format into that columnar binary format (which is a transpose, too, for cache efficiency) and then do chunking and streaming properly in that suitable format. Traditional row-store SQL databases are wrong for that because they're row-store and SQL. Same problem with CSV ... row-store.

I've reopened so we can discuss further. The examples will help a lot.

jangorecki commented 6 years ago

Just to clarify, they are wrong for analytics, but for transactions processing they make perfect sense. Before starting any work on binary columnar format it might be good to keep in mind compatibility with this "new" "thing" https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ whatever it will be :)

malcook commented 6 years ago

@mattdowle - Next gen sequencing commonly produces tens of millions (more?) of rows in Sequence Alignment Map (SAM) files which are then further scanned / tabulated often using fread (at least by me). I have wished for "chunked" reading due to memory considerations, and also, the desire to dispatch chunks to worker processes. These SAM files are often themselves ephemeral and intermediate to an analysis- databasing them would waste resources - they are produced as flat files - have a few binary variants that facilitate selected retrieval operations which are also probably best considered ephemeral, and regardless, they still occasionally require linear scanning. Perhaps I am misunderstanding your considerations, or am overlooking an approach I might be taking.

On a similar/related topic, would you entertain a request for fread to produce a list of data.tables, one for each section within a .csv, where section delimiters would be identified during scan via a regular expressions? This could easily be accomplished by two passes - one to find sections based on regular expressions and the next to passing line-numbers and counts to fread - however I thought it might make a natural extension to fread's inner loop. Thanks for your consideration of this.

mattdowle commented 6 years ago

@malcook Thanks for the info and links. If there's enough demand for it, then I should reconsider then.

Kodiologist commented 5 years ago

I'd just like to chime in as another programmer who'd appreciate this. I'd be the first to agree that a single CSV file is a poor choice of format for a massive dataset. Unfortunately, one might still get massive CSV files from other people. Two examples I've encountered so far are US voter-registration data, which can comprise tens of millions of observations per state but state governments may provide as single CSV files, and daily or hourly observations from Mexican weather stations. Such files would be easy to handle with linewise Unix tools if it weren't for quoted fields with embedded newlines. My group will probably use read_csv_chunked from the readr package.

zachmayer commented 5 years ago

Just to throw in an example I have: I've got a csv that's too big to fit into memory on my laptop.

However, I have some data munging I want to do on that dataset that amounts to:

  1. Parse dates into Date class objects (see also #1450)
  2. Clean up some messy numeric data and convert it from categorical to numeric
  3. Combine some categorical levels and covert to factor (I know the levels ahead of time)
  4. Combine many numeric columns into a single numeric column

After this operation, the file size is small enough to fit into memory, but I can't load it all at once to do it.

Right now I'm using a function like this (it isn't perfect yet):

library(pbapply)
library(data.table)
chunked_fread <- function(file, chunk_size, FUN=NULL, verbose=TRUE, ...){
  lines <- as.integer(system(paste("wc -l <", file), intern=TRUE))
  n_chunks <- ceiling(lines/chunk_size)
  if(verbose){
    print(paste(lines, 'lines,', n_chunks, 'chunks, chunk size of', chunk_size))
  }

  if(verbose){
    lapply <- pblapply
  }
  out <- lapply(1:n_chunks, function(idx){
    out <- fread(file, skip=idx, nrows=chunk_size, ...) 
    if(!is.null(FUN)){
      out <- FUN(out)
    }
    return(out)
  })
  return(rbindlist(out))
}

chunked_fread(my_file, chunk_size=100, FUN=function(x) return(x[,list(new=V1+V2)]))
xiaodaigh commented 5 years ago

@zachmayer I am also using a similar function. But I like readr's chunked function which calls a callback after reading each chunk. I want something similar for data.table so bad!

adamwaring commented 5 years ago

I'd like to chime in with an example as well. I am just analysing a genetic variant file with 421 million rows and long rows at that. I've got a few of these files that are 100s of GBs so I encounter memory issues when trying to deal with it on the server.

It is common in genetics to work with very large tabular data. Often what I would like to be able to do is to stream and filter the file to reduce its size. I usually do this filtering in bash but would be a lot smoother if I could do it all in Rdata.table.

jangorecki commented 5 years ago

related to @adamwaring use case #583

Joe-Wasserman commented 4 years ago

{disk.frame} provides a method for chunked read (and manipulation) of too-large-for-RAM tabular data. Notably, it imports {data.table}.

zachmayer commented 4 years ago

Oh awesome, disk.frame totally solves my problem. Nice that it imports data.table!