Open UnkindPartition opened 8 years ago
fread
already reads the input file in chunks, avoiding carrying large parts of input in-memory. This process is invisible to the user (unless "verbose" mode is turned on), and results in a DataTable that has data from the entire file.
If I understand correctly, this FR asks for a new fread
"mode" where the result is not a single DataTable, but rather a sequence of smaller DataTables emitted one-at-a-time via some callback or generator mechanism.
The only complication that I foresee is that fread
sometimes has to re-read the data (say, due to "out-of-sample type exceptions" or similar problems). Currently at most 1 re-read may happen, but in the future more re-read steps may be necessary to address some of the corner cases.
It would be easier to implement emitting DataTables of approximate size (i.e. not exactly 1,000,000 rows, but approximately that many). Then the logic could be implemented in pushBuffers
function, without changing freadMain
.
FWIW, this is no longer relevant for me.
To me, CSV is a one-off on the way to a binary or database. If it's so large that it won't fit and chunking is needed, then the data should be in a database or binary format.
We have done parallel reading in chunks, but to result in the final full table. The sampling uses the full file, etc.
Very unlikely we'll provide a user api for chunked csv access. skip=
and nrows=
are only really for working around format errors; those aren't meant to provide chunked access really.
Just came across this Q & A, which suggests a simple (???) way to accomplish this would be to allow fread
to handle connections as input, i.e., this is a subset of #561:
@MichaelChirico I'm not sure we are all on the same page of what it means to "support file connections". For example, python's fread
supports reading from file-like objects (analogue of file connections in R) in a trivial way: you give me a read()
-able object, I dump its content into a temporary file, then fread that file.
However this wouldn't allow chunking the input. I mean, if you ask read()
to just return a certain number of bytes, then you're virtually guaranteed to have the last line broken (perhaps even having an unterminated string literal). Re-reading the last line with the next chunk would not work either - a file connection is not always seekable. Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly. All of these requirements would break if you insist that the input must be a streamable non-seekable object. (Unless you dump it into a file that is).
what I have in mind is not for fread to handle the chunking automatically, but for the user to be able to rig something up themselves that amounts to chunking.
so, perhaps user won't have access to all the bells & whistles (e.g. column re-read as you said), but e.g. if they can supply colClasses that goes away.
I might not be fully grokking the engineering requirements though. fread(paste(readLines(f, n = 1e5), collapse = '\n')) is a way to accomplish it now, quite inefficiently of course. is there no way to imitate that without substantial code refactoring (conditional on fread accepting connections)?
paste(readLines(f, n = 1e5), collapse = '\n')
reads the entire input into a character vector, then copies the content into a single memory buffer, then passes that buffer to fread
. This is less efficient than directly dumping the content of f
into a temporary file and reading that file.
@mattdowle - I disagree with your reason for closing this. There are lots of examples of huge datasets that have no business in a database and lots of Streaming algorithms to work with them. Why not continue to keep your hat in this arena. DT excels so well in so many ways, and it already interoperates with non-seekable inputs. I expect @st-pasha's suggestion that this "could be implemented in pushBuffers function" is probably spot on.
And clearly, hope springs eternal: Reading in chunks at a time using fread in package data.table.
Just sayin'
@malcook
There are lots of examples of huge datasets that have no business in a database
Please provide one example so I can understand.
and lots of Streaming algorithms to work with them
Please provide one example.
If I understand correctly, you're asking for fread
to support this in a CSV file. We want to do such things, but not in CSV. Rather, in a columnar binary format; i.e. a database. That doesn't mean SQL, nor Oracle, MySQL or MS SQL Server, but in a columnar binary format. You get the CSV format into that columnar binary format (which is a transpose, too, for cache efficiency) and then do chunking and streaming properly in that suitable format. Traditional row-store SQL databases are wrong for that because they're row-store and SQL. Same problem with CSV ... row-store.
I've reopened so we can discuss further. The examples will help a lot.
Just to clarify, they are wrong for analytics, but for transactions processing they make perfect sense. Before starting any work on binary columnar format it might be good to keep in mind compatibility with this "new" "thing" https://blog.rstudio.com/2018/04/19/arrow-and-beyond/ whatever it will be :)
@mattdowle - Next gen sequencing commonly produces tens of millions (more?) of rows in Sequence Alignment Map (SAM) files which are then further scanned / tabulated often using fread (at least by me). I have wished for "chunked" reading due to memory considerations, and also, the desire to dispatch chunks to worker processes. These SAM files are often themselves ephemeral and intermediate to an analysis- databasing them would waste resources - they are produced as flat files - have a few binary variants that facilitate selected retrieval operations which are also probably best considered ephemeral, and regardless, they still occasionally require linear scanning. Perhaps I am misunderstanding your considerations, or am overlooking an approach I might be taking.
On a similar/related topic, would you entertain a request for fread to produce a list of data.tables, one for each section within a .csv, where section delimiters would be identified during scan via a regular expressions? This could easily be accomplished by two passes - one to find sections based on regular expressions and the next to passing line-numbers and counts to fread - however I thought it might make a natural extension to fread's inner loop. Thanks for your consideration of this.
@malcook Thanks for the info and links. If there's enough demand for it, then I should reconsider then.
I'd just like to chime in as another programmer who'd appreciate this. I'd be the first to agree that a single CSV file is a poor choice of format for a massive dataset. Unfortunately, one might still get massive CSV files from other people. Two examples I've encountered so far are US voter-registration data, which can comprise tens of millions of observations per state but state governments may provide as single CSV files, and daily or hourly observations from Mexican weather stations. Such files would be easy to handle with linewise Unix tools if it weren't for quoted fields with embedded newlines. My group will probably use read_csv_chunked
from the readr package.
Just to throw in an example I have: I've got a csv that's too big to fit into memory on my laptop.
However, I have some data munging I want to do on that dataset that amounts to:
After this operation, the file size is small enough to fit into memory, but I can't load it all at once to do it.
Right now I'm using a function like this (it isn't perfect yet):
library(pbapply)
library(data.table)
chunked_fread <- function(file, chunk_size, FUN=NULL, verbose=TRUE, ...){
lines <- as.integer(system(paste("wc -l <", file), intern=TRUE))
n_chunks <- ceiling(lines/chunk_size)
if(verbose){
print(paste(lines, 'lines,', n_chunks, 'chunks, chunk size of', chunk_size))
}
if(verbose){
lapply <- pblapply
}
out <- lapply(1:n_chunks, function(idx){
out <- fread(file, skip=idx, nrows=chunk_size, ...)
if(!is.null(FUN)){
out <- FUN(out)
}
return(out)
})
return(rbindlist(out))
}
chunked_fread(my_file, chunk_size=100, FUN=function(x) return(x[,list(new=V1+V2)]))
@zachmayer I am also using a similar function. But I like readr's chunked function which calls a callback
after reading each chunk. I want something similar for data.table so bad!
I'd like to chime in with an example as well. I am just analysing a genetic variant file with 421 million rows and long rows at that. I've got a few of these files that are 100s of GBs so I encounter memory issues when trying to deal with it on the server.
It is common in genetics to work with very large tabular data. Often what I would like to be able to do is to stream and filter the file to reduce its size. I usually do this filtering in bash but would be a lot smoother if I could do it all in Rdata.table.
related to @adamwaring use case #583
{disk.frame} provides a method for chunked read (and manipulation) of too-large-for-RAM tabular data. Notably, it imports {data.table}.
Oh awesome, disk.frame totally solves my problem. Nice that it imports data.table!
Consider a case when there's a large csv file, but it can be processed by chunks. It would be nice if
fread
could read the file in chunks. See also Reading in chunks at a time using fread in package data.table on StackOverflow.The interface would be something like
fread.apply(input, fun, chunk.size = 1000, ...)
, wherefun
would be applied (similar tolapply
) to subsequent data table chunks read frominput
of the size at mostchunk.size
.If there's a consensus, I could work on a PR.